Machine LearningAnomaly & Outlier Detection

Density-Based Methods for Anomaly Detection

LevelAdvanced

Duration90 mins

TopicAnomaly & Outlier Detection

1 / 5

Relative Density

The Density Paradigm in Anomaly Detection

In the landscape of anomaly detection, density-based methods stand out as one of the most intuitive and powerful paradigms. At their core, these methods are built upon a simple yet profound observation: anomalies tend to reside in regions of low data density, while normal instances cluster together in high-density regions.

This observation aligns with our intuitive understanding of outliers. When you visualize a scatter plot of normal data, points tend to congregate—forming clusters, following patterns, adhering to underlying distributions. Anomalies, by contrast, appear as lonely points, isolated from the crowd, occupying sparse regions of the feature space.

However, translating this intuition into a rigorous mathematical framework requires careful consideration. What exactly do we mean by 'density'? How do we measure it? And critically, how do we account for datasets where different regions naturally have different densities—where what appears sparse in one context is perfectly normal in another?

What You Will Learn

This page establishes the mathematical foundations of density-based anomaly detection. You will understand: (1) Formal definitions of density in the context of point clouds; (2) Why absolute density fails and relative density succeeds; (3) The mathematical frameworks for computing relative density; (4) How relative density connects to probabilistic interpretations; and (5) The key design decisions when implementing density-based anomaly detectors.

Formal Density Definitions

Before we can detect anomalies using density, we must first rigorously define what density means for discrete data points. Unlike continuous probability distributions where density functions are well-defined, working with finite samples requires careful construction.

The Fundamental Challenge

Given a dataset $\mathcal{D} = {\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n}$ where each $\mathbf{x}_i \in \mathbb{R}^d$, we want to assign a density value $\rho(\mathbf{x})$ to any point $\mathbf{x}$ in the feature space. The key question is: how do we estimate this density from our finite sample?

Two primary approaches dominate the literature:

1. Distance-Based Density: Measure density inversely proportional to distances to neighboring points. Regions where points are close together have high density; regions where points are far apart have low density.

2. Volume-Based Density: Count the number of points within a fixed volume. Regions containing many points have high density; regions with few points have low density.

Both approaches are valid, but they lead to different algorithms with different properties. Let's examine each in detail.

Density Estimation vs. Density-Based Anomaly Detection

It's crucial to distinguish between density estimation (covered in Chapter 22) and density-based anomaly detection. Density estimation aims to recover the true underlying probability density function $p(\mathbf{x})$. Density-based anomaly detection uses density estimates to identify outliers, but doesn't necessarily require recovering the true density—only a quantity that correctly orders points by their 'normalcy'. This subtle distinction matters when selecting methods.

Distance-Based Density Definitions

The most common distance-based density measures involve the k-nearest neighbors of a point. Let $d_k(\mathbf{x})$ denote the distance from point $\mathbf{x}$ to its $k$-th nearest neighbor in the dataset.

Definition (k-Distance Density): $$\rho_k(\mathbf{x}) = \frac{1}{d_k(\mathbf{x})}$$

This simple definition captures the intuition that points in dense regions have nearby neighbors (small $d_k$, hence large $\rho_k$), while points in sparse regions have distant neighbors (large $d_k$, hence small $\rho_k$).

Definition (Average k-Distance Density): $$\rho_{\text{avg},k}(\mathbf{x}) = \frac{k}{\sum_{i=1}^{k} d_i(\mathbf{x})}$$

where $d_i(\mathbf{x})$ is the distance to the $i$-th nearest neighbor. This definition is more robust as it averages over all k neighbors rather than depending solely on the k-th.

Definition (Harmonic Mean Distance Density): $$\rho_{\text{harm},k}(\mathbf{x}) = \frac{k}{\sum_{i=1}^{k} d_i(\mathbf{x})} = \left( \frac{1}{k} \sum_{i=1}^{k} d_i(\mathbf{x}) \right)^{-1}$$

The harmonic mean is particularly useful because it is dominated by the smallest distances, making it sensitive to the immediate neighborhood structure.

distance_based_density.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import numpy as np
from sklearn.neighbors import NearestNeighbors
 
def compute_distance_based_densities(X: np.ndarray, k: int = 5) -> dict:
    """
    Compute various distance-based density measures for all points.
    
    Parameters:
    -----------
    X : np.ndarray of shape (n_samples, n_features)
        Input data points
    k : int
        Number of neighbors to consider
        
    Returns:
    --------
    dict : Dictionary containing different density measures
    """
    n_samples = X.shape[0]
    
    # Fit k-NN model (k+1 because query point is included)
    nn = NearestNeighbors(n_neighbors=k + 1, algorithm='auto')
    nn.fit(X)
    
    # Get distances to k nearest neighbors (exclude self at index 0)
    distances, _ = nn.kneighbors(X)
    distances = distances[:, 1:]  # Exclude self-distance
    
    # k-distance density: 1 / distance to k-th neighbor
    k_distance = distances[:, -1]
    rho_k = 1.0 / (k_distance + 1e-10)  # Add epsilon to avoid division by zero
    
    # Average k-distance density: k / sum of distances to k neighbors
    sum_distances = distances.sum(axis=1)
    rho_avg = k / (sum_distances + 1e-10)
    
    # Harmonic mean distance density
    # Harmonic mean = k / sum(1/d_i) but we compute inverse differently
    rho_harm = k / (sum_distances + 1e-10)  # Same as average for raw distances
    
    # Local reachability density (used in LOF - covered later)
    # This is a preview of more sophisticated density definitions
    
    return {
        'k_distance_density': rho_k,
        'avg_distance_density': rho_avg,
        'k_distances': k_distance,
        'all_distances': distances
    }
 
# Example usage
np.random.seed(42)
# Generate clustered normal data with some outliers
normal_data = np.random.randn(100, 2) * 0.5 + np.array([0, 0])
outliers = np.array([[3, 3], [-3, 3], [0, -4]])
X = np.vstack([normal_data, outliers])
 
densities = compute_distance_based_densities(X, k=5)
print("Density statistics (normal points):", 
      f"mean={densities['k_distance_density'][:100].mean():.4f}")
print("Density statistics (outliers):", 
      f"mean={densities['k_distance_density'][100:].mean():.4f}")

Volume-Based Density Definitions

Volume-based approaches flip the perspective: instead of measuring distances, we fix a volume and count points within it.

Definition (ε-Neighborhood Density): $$\rho_\varepsilon(\mathbf{x}) = \frac{|N_\varepsilon(\mathbf{x})|}{V(\varepsilon)}$$

where $N_\varepsilon(\mathbf{x}) = {\mathbf{x}_i \in \mathcal{D} : d(\mathbf{x}, \mathbf{x}_i) \leq \varepsilon}$ is the ε-neighborhood, and $V(\varepsilon)$ is the volume of a ball of radius ε in d dimensions.

For a d-dimensional Euclidean ball: $$V(\varepsilon) = \frac{\pi^{d/2}}{\Gamma(d/2 + 1)} \varepsilon^d$$

This volume grows exponentially with dimension, a fact that becomes critically important when we discuss scalability and the curse of dimensionality.

Definition (Kernel Density Estimate): $$\hat{p}(\mathbf{x}) = \frac{1}{n \cdot h^d} \sum_{i=1}^{n} K\left(\frac{\mathbf{x} - \mathbf{x}_i}{h}\right)$$

where $K$ is a kernel function (often Gaussian) and $h$ is the bandwidth. This provides a smooth density estimate that can serve as the basis for anomaly detection.

The Critical Insight: Why Relative Density Matters

With formal density definitions in hand, one might imagine that anomaly detection is straightforward: simply identify points with low absolute density. This approach fails spectacularly in practice. Understanding why leads us to the crucial concept of relative density.

The Multi-Density Problem

Real-world datasets frequently contain regions with naturally different densities. Consider these examples:

Geographic data: Urban areas have high population density; rural areas have low density. A point in a rural area isn't anomalous—it's just in a sparse region.
Financial transactions: High-frequency trading generates dense clusters of small transactions; large institutional trades are infrequent but normal.
Network traffic: Burst patterns create dense clusters; idle periods create sparse regions. Both are normal behavior.
Sensor readings: Some sensors report frequently; others report rarely. Low reporting frequency doesn't indicate malfunction.

In all these cases, absolute density fails as an anomaly criterion because it cannot distinguish between:

A normal point in a sparse region
An anomalous point in a dense region
A truly isolated anomaly

The Global Threshold Trap

Using a single global density threshold—marking all points below threshold as anomalies—is one of the most common mistakes in anomaly detection. If the threshold is set based on the average density, you'll flag all points in naturally sparse regions while missing true anomalies in dense regions. This is why algorithms like LOF (Local Outlier Factor) were developed.

Formalizing Relative Density

The solution is to compare each point's density not to a global threshold, but to the density of its neighbors. This leads to the concept of relative density or local density ratio.

Definition (Relative Density): $$\text{RelDensity}(\mathbf{x}) = \frac{\rho(\mathbf{x})}{\frac{1}{k}\sum_{\mathbf{y} \in N_k(\mathbf{x})} \rho(\mathbf{y})}$$

This ratio compares the density of point $\mathbf{x}$ to the average density of its k-nearest neighbors. The interpretation is elegant:

RelDensity ≈ 1: Point has similar density to its neighbors → Normal
RelDensity >> 1: Point has higher density than neighbors → Dense core point
RelDensity << 1: Point has lower density than neighbors → Potential anomaly

The Local Outlier Factor (LOF) Formulation

The Local Outlier Factor, introduced by Breunig et al. (2000), is the most influential formalization of relative density for anomaly detection. We'll cover LOF in depth in Module 3b, but the key insight is relevant here:

$$\text{LOF}k(\mathbf{x}) = \frac{1}{k} \sum{\mathbf{y} \in N_k(\mathbf{x})} \frac{\text{lrd}_k(\mathbf{y})}{\text{lrd}_k(\mathbf{x})}$$

where lrd is the local reachability density. Notice this is the inverse of relative density—high LOF indicates low relative density, hence anomaly. This inversion is a convention to make higher scores correspond to greater 'outlierness'.

Absolute vs. Relative Density: Behavior in Different Scenarios
Scenario	Absolute Density	Relative Density	Correct Label
Dense cluster, normal point	High	≈ 1	Normal ✓ (both correct)
Sparse cluster, normal point	Low	≈ 1	Normal (only relative correct)
Dense cluster, local outlier	Medium	<< 1	Anomaly (only relative correct)
Sparse cluster, local outlier	Very Low	<< 1	Anomaly ✓ (both correct)
Between two clusters (bridge)	Medium	Variable	Context-dependent

Mathematical Properties of Relative Density

Relative density possesses several important mathematical properties:

1. Scale Invariance: Relative density is invariant to uniform scaling of the data. If all distances are multiplied by a constant, individual densities change but their ratios remain constant.

2. Translation Invariance: Shifting the entire dataset doesn't affect relative densities, as the spatial relationships between points are preserved.

3. Local Adaptivity: The 'reference density' adapts to the local neighborhood, allowing the method to handle datasets with varying density regions.

4. Sensitivity to k: The choice of k controls the scope of 'locality'. Small k makes the measure sensitive to immediate neighbors; large k provides more stable but less local estimates.

$$\lim_{k \to n} \text{RelDensity}(\mathbf{x}) \to \frac{\rho(\mathbf{x})}{\bar{\rho}}$$

As k approaches n (dataset size), relative density converges to the ratio of local density to global average density, recovering a global measure.

Geometric and Probabilistic Interpretations

Understanding relative density geometrically and probabilistically deepens our intuition and guides algorithmic choices.

Geometric View: Voronoi Cells and Delaunay Triangulation

One powerful geometric interpretation involves Voronoi cells. The Voronoi cell of a point $\mathbf{x}_i$ is the region of space closer to $\mathbf{x}_i$ than to any other point:

$$V_i = {\mathbf{x} \in \mathbb{R}^d : d(\mathbf{x}, \mathbf{x}_i) < d(\mathbf{x}, \mathbf{x}_j) ; \forall j \neq i}$$

The volume of the Voronoi cell is inversely related to density:

Small Voronoi cell: Point is in a dense region (neighbors are close)
Large Voronoi cell: Point is in a sparse region (neighbors are far)

For anomaly detection, we can define: $$\rho_{\text{Voronoi}}(\mathbf{x}_i) \propto \frac{1}{\text{Vol}(V_i)}$$

While Voronoi-based density is theoretically elegant, computing exact Voronoi cells in high dimensions is computationally expensive, which is why k-NN approximations dominate practical applications.

Voronoi Cells and Nearest Neighbors

There's a deep connection between Voronoi diagrams and nearest neighbor methods. The k-nearest neighbor graph is, in a sense, an approximation to the local structure captured by Voronoi/Delaunay tessellations. The Delaunay triangulation (dual of Voronoi) connects points that are 'natural neighbors', which often coincides with k-NN for appropriate k.

Probabilistic Interpretation

From a probabilistic perspective, density-based anomaly detection can be viewed as estimating the likelihood of observing a point under the underlying data-generating distribution.

If data is drawn from distribution $p(\mathbf{x})$, then: $$P(\text{anomaly} | \mathbf{x}) \propto \frac{1}{p(\mathbf{x})}$$

Points in low-density regions have low probability under the data-generating process, hence are likely anomalous. This connects density-based methods to the broader field of probabilistic anomaly detection.

The Relative Probability View:

Relative density can be interpreted as comparing the local probability to a neighborhood-averaged probability:

$$\text{RelDensity}(\mathbf{x}) \approx \frac{p(\mathbf{x})}{\mathbb{E}_{\mathbf{y} \sim \mathcal{N}(\mathbf{x})}[p(\mathbf{y})]}$$

This ratio asks: "Is this point less likely than we'd expect given its neighborhood?" A ratio significantly below 1 indicates the point is an unexpected outlier relative to its context.

Connection to Information Theory

Information-theoretically, low-density regions carry high surprise or self-information: $$I(\mathbf{x}) = -\log p(\mathbf{x})$$

Anomalies are points with high self-information—they're surprising given the learned distribution. This connects density-based anomaly detection to concepts like:

Entropy: Anomalies increase the entropy of the dataset
Compression: Anomalies are hard to compress because they don't fit the pattern
Minimum Description Length: Anomalies require extra bits to describe

relative_density_computation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
import numpy as np
from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt
 
def compute_relative_density(X: np.ndarray, k: int = 10) -> np.ndarray:
    """
    Compute relative density for each point in the dataset.
    
    Relative density compares a point's local density to the average
    density of its k-nearest neighbors.
    
    Parameters:
    -----------
    X : np.ndarray of shape (n_samples, n_features)
        Input data points
    k : int
        Number of neighbors for density estimation
        
    Returns:
    --------
    np.ndarray : Relative density scores (values < 1 indicate potential anomalies)
    """
    n_samples = X.shape[0]
    
    # Step 1: Find k-nearest neighbors for each point
    nn = NearestNeighbors(n_neighbors=k + 1, algorithm='auto')
    nn.fit(X)
    distances, indices = nn.kneighbors(X)
    
    # Exclude self-distances and self-indices
    distances = distances[:, 1:]
    neighbor_indices = indices[:, 1:]
    
    # Step 2: Compute local density for each point
    # Using average k-distance as density proxy (inverse of distance sum)
    local_densities = k / (np.sum(distances, axis=1) + 1e-10)
    
    # Step 3: Compute relative density
    # For each point, compare its density to the average density of neighbors
    relative_densities = np.zeros(n_samples)
    
    for i in range(n_samples):
        # Get densities of neighbors
        neighbor_densities = local_densities[neighbor_indices[i]]
        # Average neighbor density
        avg_neighbor_density = np.mean(neighbor_densities)
        # Relative density = my density / neighbor average density
        relative_densities[i] = local_densities[i] / (avg_neighbor_density + 1e-10)
    
    return relative_densities
 
def visualize_relative_density(X, relative_densities, title="Relative Density"):
    """Visualize points colored by relative density."""
    plt.figure(figsize=(10, 8))
    
    # Use log scale for better visualization
    log_rd = np.log10(relative_densities + 0.01)
    
    scatter = plt.scatter(X[:, 0], X[:, 1], c=log_rd, cmap='RdYlBu', 
                         s=50, alpha=0.7, edgecolors='k', linewidths=0.5)
    plt.colorbar(scatter, label='log10(Relative Density)')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.title(title)
    
    # Mark anomalies (relative density < 0.5)
    anomaly_mask = relative_densities < 0.5
    if np.any(anomaly_mask):
        plt.scatter(X[anomaly_mask, 0], X[anomaly_mask, 1], 
                   s=200, facecolors='none', edgecolors='red', linewidths=2,
                   label=f'Potential Anomalies (n={np.sum(anomaly_mask)})')
        plt.legend()
    
    plt.tight_layout()
    return plt.gcf()
 
# Demonstration: Multi-density dataset
np.random.seed(123)
 
# Create dataset with two clusters of different densities
dense_cluster = np.random.randn(150, 2) * 0.3 + np.array([0, 0])
sparse_cluster = np.random.randn(50, 2) * 1.5 + np.array([5, 5])
 
# Add anomalies
anomaly1 = np.array([[2, 2]])  # Between clusters
anomaly2 = np.array([[-0.5, 2]])  # Near dense cluster (local outlier)
anomaly3 = np.array([[7, 2]])  # Near sparse cluster (local outlier)
 
X = np.vstack([dense_cluster, sparse_cluster, anomaly1, anomaly2, anomaly3])
 
# Compute relative densities
rd = compute_relative_density(X, k=10)
 
print("Relative Density Statistics:")
print(f"  Dense cluster points: mean={rd[:150].mean():.3f}, min={rd[:150].min():.3f}")
print(f"  Sparse cluster points: mean={rd[150:200].mean():.3f}, min={rd[150:200].min():.3f}")
print(f"  Anomaly 1 (between clusters): {rd[200]:.3f}")
print(f"  Anomaly 2 (near dense): {rd[201]:.3f}")
print(f"  Anomaly 3 (near sparse): {rd[202]:.3f}")

Design Decisions in Density-Based Detection

Implementing density-based anomaly detection requires several critical design decisions. Each choice involves tradeoffs that affect accuracy, computational cost, and robustness.

Choice of k: The Locality Parameter

The parameter k controls how 'local' the density estimate is. This is perhaps the most important hyperparameter in density-based methods.

Small k (e.g., 3-5):

Highly local density estimate
Sensitive to noise and sampling variations
Can detect micro-outliers within tight clusters
High variance; estimates fluctuate significantly

Large k (e.g., 50-100):

More global density estimate
Smoother, more stable estimates
May miss local anomalies masked by larger neighborhood
Lower variance but potential bias

Guidelines for k Selection:

k should be large enough to estimate density reliably (rule of thumb: k ≥ 10)
k should be small enough to maintain locality (k << n/10 typically)
k should be proportional to expected cluster sizes
Multiple k values can be combined for robustness

Impact of k on Detection Characteristics
k Value	Locality	Stability	Sensitivity to Local Anomalies	Computational Cost
k = 3-5	Very High	Low (noisy)	Very High	Lower
k = 10-20	High	Moderate	High	Moderate
k = 30-50	Moderate	Good	Moderate	Moderate
k = 100+	Low	Very Good	Low	Higher

Distance Metric Selection

The notion of 'neighbor' depends on how we measure distance. The choice of distance metric profoundly affects which points are considered neighbors and hence the density estimate.

Euclidean Distance: Most common for continuous features. Assumes isotropic feature importance. $$d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_{i=1}^{d} (x_i - y_i)^2}$$

Manhattan Distance (L1): More robust to outlier features. Better for sparse, high-dimensional data. $$d(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^{d} |x_i - y_i|$$

Mahalanobis Distance: Accounts for feature correlations and scales. Requires covariance estimation. $$d(\mathbf{x}, \mathbf{y}) = \sqrt{(\mathbf{x} - \mathbf{y})^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \mathbf{y})}$$

Cosine Distance: For directional data. Common in text and high-dimensional sparse data. $$d(\mathbf{x}, \mathbf{y}) = 1 - \frac{\mathbf{x} \cdot \mathbf{y}}{|\mathbf{x}| |\mathbf{y}|}$$

Feature Preprocessing Matters

Distance-based methods are sensitive to feature scales. Always standardize features (zero mean, unit variance) unless you have domain knowledge suggesting otherwise. For mixed data types, consider feature-specific distance functions combined via weighted sums.

Density Aggregation Strategies

Given k neighbor distances, how should we aggregate them into a single density score?

Direct k-Distance: Use only the k-th neighbor distance. Simple but noisy.

Mean Distance: Average of all k neighbor distances. Balanced approach.

Harmonic Mean: Emphasizes closer neighbors, useful when immediate neighborhood matters more.

Reachability-Based (LOF style): Uses max of k-distance and actual distance to smooth local variations.

$$\text{reach-dist}_k(\mathbf{x}, \mathbf{y}) = \max(d_k(\mathbf{y}), d(\mathbf{x}, \mathbf{y}))$$

This clever construction ensures that for dense cores, the reachability distance is dominated by the core's k-distance, providing stability.

Threshold Selection for Binary Classification

Density-based methods typically output continuous scores. Converting these to binary anomaly labels requires a threshold:

Fixed Percentile: Label top p% lowest density points as anomalies. Requires prior knowledge of contamination rate.
Statistical: Use mean + n*std as threshold. Assumes score distribution is approximately normal.
Elbow Method: Plot sorted scores, find elbow point where density drops sharply.
Contamination Factor: Some algorithms (like LOF in sklearn) accept expected contamination rate as input.

Edge Cases and Pathological Scenarios

Robust anomaly detection systems must handle edge cases gracefully. Understanding pathological scenarios helps design more resilient systems.

Duplicate Points

When duplicate points exist (identical feature values), distance-based density may become ill-defined:

Distance to neighbor = 0 implies infinite density
k-distance may be 0 if k duplicates exist

Solutions:

Add small random noise (jittering)
Use distance + small ε to avoid division by zero
Collapse duplicates and weight by count
Use rank-based methods instead of distance-based

Extreme Class Imbalance in Neighborhoods

In some datasets, neighborhoods may contain extremely heterogeneous densities:

One point might have neighbors at distances [0.01, 0.02, 0.03, 10.0, 15.0]
The distribution of neighbor distances is highly skewed

Solutions:

Use robust aggregation (median instead of mean)
Apply logarithmic transformation to distances
Consider only k_min closest neighbors
Use adaptive neighborhood sizes

The Curse of Symmetry

In some geometric configurations, relative density may fail. Consider a perfect grid of points—all have identical absolute density and identical relative density (= 1), yet edge points might be contextually different. Similarly, in datasets with highly regular structure, density-based methods may not reveal the geometric anomalies you seek.

Boundary Effects

Points on the boundary of the data manifold have neighbors only on one side, potentially leading to:

Lower apparent density than interior points
False positive anomalies at cluster edges

Solutions:

Consider boundary-aware density estimation
Use convex hull detection to identify boundary points
Adjust k based on local geometry
Ensemble with other methods that don't rely on density

Density Plateaus

When large regions have uniform density, relative density approaches 1 for all points, providing no discrimination:

All points in a uniform cluster have RelDensity ≈ 1
True anomaly embedded in uniform region still has RelDensity ≈ 1 if k is too large

Solutions:

Use smaller k to increase sensitivity
Combine density with distance-to-boundary measures
Apply hierarchical approaches that examine multiple scales

Streaming and Non-Stationary Data

When data arrives as a stream or the underlying distribution changes:

Historical density estimates become stale
Newly normal patterns may appear anomalous initially
Concept drift invalidates fixed models

Solutions:

Maintain sliding window of reference data
Use incremental nearest-neighbor structures
Implement decay factors for old data
Periodic model retraining with drift detection

Robustness Checklist for Density-Based Systems

•Duplicate handling: Ensure no division by zero when duplicates exist
•Scale normalization: Standardize or normalize features before computing distances
•k validation: Verify k < number of samples; handle small datasets gracefully
•Distance validity: Handle NaN/Inf distances from missing features
•Numerical stability: Add epsilon to denominators; use log-scale for extreme values
•Memory bounds: Consider approximate methods for very large datasets
•Boundary awareness: Document expected behavior at data manifold boundaries
•Concept drift: Plan for model updates in non-stationary environments

Summary: The Foundation of Relative Density

We have established the theoretical foundation for density-based anomaly detection, with relative density as the central concept.

Key Mathematical Insights

Density is not inherently meaningful for anomaly detection. Absolute density varies naturally across datasets. A point in a sparse region isn't necessarily anomalous—it might simply belong to a low-density cluster.
Relative density provides local context. By comparing a point's density to its neighbors' densities, we can distinguish between:
- Normal points in sparse regions (RelDensity ≈ 1)
- Anomalous points in any region (RelDensity << 1)
The choice of k controls the scale of locality. Small k provides fine-grained, high-variance estimates; large k provides coarse, low-variance estimates. Practitioners must balance sensitivity against stability.
Distance metrics define the neighborhood structure. The choice of metric (Euclidean, Manhattan, Mahalanobis, etc.) determines which points are considered neighbors, fundamentally shaping the density landscape.

Core Principle Established

Relative density is the ratio of a point's local density to the average density of its neighborhood. Points with significantly lower relative density than expected are candidate anomalies. This principle underlies LOF, LOCI, and many modern density-based anomaly detection algorithms.

Looking Ahead

With relative density as our conceptual foundation, we now have the tools to understand specific density-based anomaly detection algorithms:

Next Page (DBSCAN for Outliers): We'll explore how the clustering algorithm DBSCAN naturally identifies noise points as anomalies, leveraging density concepts in a different way than LOF.
Subsequent Topics: Local density estimation techniques, multivariate density considerations, and scalability strategies for production deployments.

The density paradigm is remarkably versatile. From simple k-NN density to sophisticated local reachability densities, the core insight remains: anomalies are points whose density is unusually low relative to their local context.

1 / 5

Loading learning content...

Machine LearningAnomaly & Outlier Detection

Density-Based Methods for Anomaly Detection

LevelAdvanced

Duration90 mins

TopicAnomaly & Outlier Detection

1 / 5

Relative Density

The Density Paradigm in Anomaly Detection

What You Will Learn

Formal Density Definitions

The Fundamental Challenge

Two primary approaches dominate the literature:

2. Volume-Based Density: Count the number of points within a fixed volume. Regions containing many points have high density; regions with few points have low density.

Both approaches are valid, but they lead to different algorithms with different properties. Let's examine each in detail.

Density Estimation vs. Density-Based Anomaly Detection

Distance-Based Density Definitions

Definition (k-Distance Density): $$\rho_k(\mathbf{x}) = \frac{1}{d_k(\mathbf{x})}$$

Definition (Average k-Distance Density): $$\rho_{\text{avg},k}(\mathbf{x}) = \frac{k}{\sum_{i=1}^{k} d_i(\mathbf{x})}$$

where $d_i(\mathbf{x})$ is the distance to the $i$-th nearest neighbor. This definition is more robust as it averages over all k neighbors rather than depending solely on the k-th.

Definition (Harmonic Mean Distance Density): $$\rho_{\text{harm},k}(\mathbf{x}) = \frac{k}{\sum_{i=1}^{k} d_i(\mathbf{x})} = \left( \frac{1}{k} \sum_{i=1}^{k} d_i(\mathbf{x}) \right)^{-1}$$

The harmonic mean is particularly useful because it is dominated by the smallest distances, making it sensitive to the immediate neighborhood structure.

distance_based_density.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import numpy as np
from sklearn.neighbors import NearestNeighbors
 
def compute_distance_based_densities(X: np.ndarray, k: int = 5) -> dict:
    """
    Compute various distance-based density measures for all points.
    
    Parameters:
    -----------
    X : np.ndarray of shape (n_samples, n_features)
        Input data points
    k : int
        Number of neighbors to consider
        
    Returns:
    --------
    dict : Dictionary containing different density measures
    """
    n_samples = X.shape[0]
    
    # Fit k-NN model (k+1 because query point is included)
    nn = NearestNeighbors(n_neighbors=k + 1, algorithm='auto')
    nn.fit(X)
    
    # Get distances to k nearest neighbors (exclude self at index 0)
    distances, _ = nn.kneighbors(X)
    distances = distances[:, 1:]  # Exclude self-distance
    
    # k-distance density: 1 / distance to k-th neighbor
    k_distance = distances[:, -1]
    rho_k = 1.0 / (k_distance + 1e-10)  # Add epsilon to avoid division by zero
    
    # Average k-distance density: k / sum of distances to k neighbors
    sum_distances = distances.sum(axis=1)
    rho_avg = k / (sum_distances + 1e-10)
    
    # Harmonic mean distance density
    # Harmonic mean = k / sum(1/d_i) but we compute inverse differently
    rho_harm = k / (sum_distances + 1e-10)  # Same as average for raw distances
    
    # Local reachability density (used in LOF - covered later)
    # This is a preview of more sophisticated density definitions
    
    return {
        'k_distance_density': rho_k,
        'avg_distance_density': rho_avg,
        'k_distances': k_distance,
        'all_distances': distances
    }
 
# Example usage
np.random.seed(42)
# Generate clustered normal data with some outliers
normal_data = np.random.randn(100, 2) * 0.5 + np.array([0, 0])
outliers = np.array([[3, 3], [-3, 3], [0, -4]])
X = np.vstack([normal_data, outliers])
 
densities = compute_distance_based_densities(X, k=5)
print("Density statistics (normal points):", 
      f"mean={densities['k_distance_density'][:100].mean():.4f}")
print("Density statistics (outliers):", 
      f"mean={densities['k_distance_density'][100:].mean():.4f}")

Volume-Based Density Definitions

Volume-based approaches flip the perspective: instead of measuring distances, we fix a volume and count points within it.

Definition (ε-Neighborhood Density): $$\rho_\varepsilon(\mathbf{x}) = \frac{|N_\varepsilon(\mathbf{x})|}{V(\varepsilon)}$$

For a d-dimensional Euclidean ball: $$V(\varepsilon) = \frac{\pi^{d/2}}{\Gamma(d/2 + 1)} \varepsilon^d$$

This volume grows exponentially with dimension, a fact that becomes critically important when we discuss scalability and the curse of dimensionality.

Definition (Kernel Density Estimate): $$\hat{p}(\mathbf{x}) = \frac{1}{n \cdot h^d} \sum_{i=1}^{n} K\left(\frac{\mathbf{x} - \mathbf{x}_i}{h}\right)$$

where $K$ is a kernel function (often Gaussian) and $h$ is the bandwidth. This provides a smooth density estimate that can serve as the basis for anomaly detection.

The Critical Insight: Why Relative Density Matters

The Multi-Density Problem

Real-world datasets frequently contain regions with naturally different densities. Consider these examples:

Geographic data: Urban areas have high population density; rural areas have low density. A point in a rural area isn't anomalous—it's just in a sparse region.
Financial transactions: High-frequency trading generates dense clusters of small transactions; large institutional trades are infrequent but normal.
Network traffic: Burst patterns create dense clusters; idle periods create sparse regions. Both are normal behavior.
Sensor readings: Some sensors report frequently; others report rarely. Low reporting frequency doesn't indicate malfunction.

In all these cases, absolute density fails as an anomaly criterion because it cannot distinguish between:

A normal point in a sparse region
An anomalous point in a dense region
A truly isolated anomaly

The Global Threshold Trap

Formalizing Relative Density

The solution is to compare each point's density not to a global threshold, but to the density of its neighbors. This leads to the concept of relative density or local density ratio.

Definition (Relative Density): $$\text{RelDensity}(\mathbf{x}) = \frac{\rho(\mathbf{x})}{\frac{1}{k}\sum_{\mathbf{y} \in N_k(\mathbf{x})} \rho(\mathbf{y})}$$

This ratio compares the density of point $\mathbf{x}$ to the average density of its k-nearest neighbors. The interpretation is elegant:

RelDensity ≈ 1: Point has similar density to its neighbors → Normal
RelDensity >> 1: Point has higher density than neighbors → Dense core point
RelDensity << 1: Point has lower density than neighbors → Potential anomaly

The Local Outlier Factor (LOF) Formulation

$$\text{LOF}k(\mathbf{x}) = \frac{1}{k} \sum{\mathbf{y} \in N_k(\mathbf{x})} \frac{\text{lrd}_k(\mathbf{y})}{\text{lrd}_k(\mathbf{x})}$$

Absolute vs. Relative Density: Behavior in Different Scenarios
Scenario	Absolute Density	Relative Density	Correct Label
Dense cluster, normal point	High	≈ 1	Normal ✓ (both correct)
Sparse cluster, normal point	Low	≈ 1	Normal (only relative correct)
Dense cluster, local outlier	Medium	<< 1	Anomaly (only relative correct)
Sparse cluster, local outlier	Very Low	<< 1	Anomaly ✓ (both correct)
Between two clusters (bridge)	Medium	Variable	Context-dependent

Mathematical Properties of Relative Density

Relative density possesses several important mathematical properties:

1. Scale Invariance: Relative density is invariant to uniform scaling of the data. If all distances are multiplied by a constant, individual densities change but their ratios remain constant.

2. Translation Invariance: Shifting the entire dataset doesn't affect relative densities, as the spatial relationships between points are preserved.

3. Local Adaptivity: The 'reference density' adapts to the local neighborhood, allowing the method to handle datasets with varying density regions.

4. Sensitivity to k: The choice of k controls the scope of 'locality'. Small k makes the measure sensitive to immediate neighbors; large k provides more stable but less local estimates.

$$\lim_{k \to n} \text{RelDensity}(\mathbf{x}) \to \frac{\rho(\mathbf{x})}{\bar{\rho}}$$

As k approaches n (dataset size), relative density converges to the ratio of local density to global average density, recovering a global measure.

Geometric and Probabilistic Interpretations

Understanding relative density geometrically and probabilistically deepens our intuition and guides algorithmic choices.

Geometric View: Voronoi Cells and Delaunay Triangulation

One powerful geometric interpretation involves Voronoi cells. The Voronoi cell of a point $\mathbf{x}_i$ is the region of space closer to $\mathbf{x}_i$ than to any other point:

$$V_i = {\mathbf{x} \in \mathbb{R}^d : d(\mathbf{x}, \mathbf{x}_i) < d(\mathbf{x}, \mathbf{x}_j) ; \forall j \neq i}$$

The volume of the Voronoi cell is inversely related to density:

Small Voronoi cell: Point is in a dense region (neighbors are close)
Large Voronoi cell: Point is in a sparse region (neighbors are far)

For anomaly detection, we can define: $$\rho_{\text{Voronoi}}(\mathbf{x}_i) \propto \frac{1}{\text{Vol}(V_i)}$$

While Voronoi-based density is theoretically elegant, computing exact Voronoi cells in high dimensions is computationally expensive, which is why k-NN approximations dominate practical applications.

Voronoi Cells and Nearest Neighbors

Probabilistic Interpretation

From a probabilistic perspective, density-based anomaly detection can be viewed as estimating the likelihood of observing a point under the underlying data-generating distribution.

If data is drawn from distribution $p(\mathbf{x})$, then: $$P(\text{anomaly} | \mathbf{x}) \propto \frac{1}{p(\mathbf{x})}$$

The Relative Probability View:

Relative density can be interpreted as comparing the local probability to a neighborhood-averaged probability:

$$\text{RelDensity}(\mathbf{x}) \approx \frac{p(\mathbf{x})}{\mathbb{E}_{\mathbf{y} \sim \mathcal{N}(\mathbf{x})}[p(\mathbf{y})]}$$

This ratio asks: "Is this point less likely than we'd expect given its neighborhood?" A ratio significantly below 1 indicates the point is an unexpected outlier relative to its context.

Connection to Information Theory

Information-theoretically, low-density regions carry high surprise or self-information: $$I(\mathbf{x}) = -\log p(\mathbf{x})$$

Anomalies are points with high self-information—they're surprising given the learned distribution. This connects density-based anomaly detection to concepts like:

Entropy: Anomalies increase the entropy of the dataset
Compression: Anomalies are hard to compress because they don't fit the pattern
Minimum Description Length: Anomalies require extra bits to describe

relative_density_computation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
import numpy as np
from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt
 
def compute_relative_density(X: np.ndarray, k: int = 10) -> np.ndarray:
    """
    Compute relative density for each point in the dataset.
    
    Relative density compares a point's local density to the average
    density of its k-nearest neighbors.
    
    Parameters:
    -----------
    X : np.ndarray of shape (n_samples, n_features)
        Input data points
    k : int
        Number of neighbors for density estimation
        
    Returns:
    --------
    np.ndarray : Relative density scores (values < 1 indicate potential anomalies)
    """
    n_samples = X.shape[0]
    
    # Step 1: Find k-nearest neighbors for each point
    nn = NearestNeighbors(n_neighbors=k + 1, algorithm='auto')
    nn.fit(X)
    distances, indices = nn.kneighbors(X)
    
    # Exclude self-distances and self-indices
    distances = distances[:, 1:]
    neighbor_indices = indices[:, 1:]
    
    # Step 2: Compute local density for each point
    # Using average k-distance as density proxy (inverse of distance sum)
    local_densities = k / (np.sum(distances, axis=1) + 1e-10)
    
    # Step 3: Compute relative density
    # For each point, compare its density to the average density of neighbors
    relative_densities = np.zeros(n_samples)
    
    for i in range(n_samples):
        # Get densities of neighbors
        neighbor_densities = local_densities[neighbor_indices[i]]
        # Average neighbor density
        avg_neighbor_density = np.mean(neighbor_densities)
        # Relative density = my density / neighbor average density
        relative_densities[i] = local_densities[i] / (avg_neighbor_density + 1e-10)
    
    return relative_densities
 
def visualize_relative_density(X, relative_densities, title="Relative Density"):
    """Visualize points colored by relative density."""
    plt.figure(figsize=(10, 8))
    
    # Use log scale for better visualization
    log_rd = np.log10(relative_densities + 0.01)
    
    scatter = plt.scatter(X[:, 0], X[:, 1], c=log_rd, cmap='RdYlBu', 
                         s=50, alpha=0.7, edgecolors='k', linewidths=0.5)
    plt.colorbar(scatter, label='log10(Relative Density)')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.title(title)
    
    # Mark anomalies (relative density < 0.5)
    anomaly_mask = relative_densities < 0.5
    if np.any(anomaly_mask):
        plt.scatter(X[anomaly_mask, 0], X[anomaly_mask, 1], 
                   s=200, facecolors='none', edgecolors='red', linewidths=2,
                   label=f'Potential Anomalies (n={np.sum(anomaly_mask)})')
        plt.legend()
    
    plt.tight_layout()
    return plt.gcf()
 
# Demonstration: Multi-density dataset
np.random.seed(123)
 
# Create dataset with two clusters of different densities
dense_cluster = np.random.randn(150, 2) * 0.3 + np.array([0, 0])
sparse_cluster = np.random.randn(50, 2) * 1.5 + np.array([5, 5])
 
# Add anomalies
anomaly1 = np.array([[2, 2]])  # Between clusters
anomaly2 = np.array([[-0.5, 2]])  # Near dense cluster (local outlier)
anomaly3 = np.array([[7, 2]])  # Near sparse cluster (local outlier)
 
X = np.vstack([dense_cluster, sparse_cluster, anomaly1, anomaly2, anomaly3])
 
# Compute relative densities
rd = compute_relative_density(X, k=10)
 
print("Relative Density Statistics:")
print(f"  Dense cluster points: mean={rd[:150].mean():.3f}, min={rd[:150].min():.3f}")
print(f"  Sparse cluster points: mean={rd[150:200].mean():.3f}, min={rd[150:200].min():.3f}")
print(f"  Anomaly 1 (between clusters): {rd[200]:.3f}")
print(f"  Anomaly 2 (near dense): {rd[201]:.3f}")
print(f"  Anomaly 3 (near sparse): {rd[202]:.3f}")

Design Decisions in Density-Based Detection

Implementing density-based anomaly detection requires several critical design decisions. Each choice involves tradeoffs that affect accuracy, computational cost, and robustness.

Choice of k: The Locality Parameter

The parameter k controls how 'local' the density estimate is. This is perhaps the most important hyperparameter in density-based methods.

Small k (e.g., 3-5):

Highly local density estimate
Sensitive to noise and sampling variations
Can detect micro-outliers within tight clusters
High variance; estimates fluctuate significantly

Large k (e.g., 50-100):

More global density estimate
Smoother, more stable estimates
May miss local anomalies masked by larger neighborhood
Lower variance but potential bias

Guidelines for k Selection:

k should be large enough to estimate density reliably (rule of thumb: k ≥ 10)
k should be small enough to maintain locality (k << n/10 typically)
k should be proportional to expected cluster sizes
Multiple k values can be combined for robustness

Impact of k on Detection Characteristics
k Value	Locality	Stability	Sensitivity to Local Anomalies	Computational Cost
k = 3-5	Very High	Low (noisy)	Very High	Lower
k = 10-20	High	Moderate	High	Moderate
k = 30-50	Moderate	Good	Moderate	Moderate
k = 100+	Low	Very Good	Low	Higher

Distance Metric Selection

The notion of 'neighbor' depends on how we measure distance. The choice of distance metric profoundly affects which points are considered neighbors and hence the density estimate.

Euclidean Distance: Most common for continuous features. Assumes isotropic feature importance. $$d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_{i=1}^{d} (x_i - y_i)^2}$$

Manhattan Distance (L1): More robust to outlier features. Better for sparse, high-dimensional data. $$d(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^{d} |x_i - y_i|$$

Cosine Distance: For directional data. Common in text and high-dimensional sparse data. $$d(\mathbf{x}, \mathbf{y}) = 1 - \frac{\mathbf{x} \cdot \mathbf{y}}{|\mathbf{x}| |\mathbf{y}|}$$

Feature Preprocessing Matters

Density Aggregation Strategies

Given k neighbor distances, how should we aggregate them into a single density score?

Direct k-Distance: Use only the k-th neighbor distance. Simple but noisy.

Mean Distance: Average of all k neighbor distances. Balanced approach.

Harmonic Mean: Emphasizes closer neighbors, useful when immediate neighborhood matters more.

Reachability-Based (LOF style): Uses max of k-distance and actual distance to smooth local variations.

$$\text{reach-dist}_k(\mathbf{x}, \mathbf{y}) = \max(d_k(\mathbf{y}), d(\mathbf{x}, \mathbf{y}))$$

This clever construction ensures that for dense cores, the reachability distance is dominated by the core's k-distance, providing stability.

Threshold Selection for Binary Classification

Density-based methods typically output continuous scores. Converting these to binary anomaly labels requires a threshold:

Fixed Percentile: Label top p% lowest density points as anomalies. Requires prior knowledge of contamination rate.
Statistical: Use mean + n*std as threshold. Assumes score distribution is approximately normal.
Elbow Method: Plot sorted scores, find elbow point where density drops sharply.
Contamination Factor: Some algorithms (like LOF in sklearn) accept expected contamination rate as input.

Edge Cases and Pathological Scenarios

Robust anomaly detection systems must handle edge cases gracefully. Understanding pathological scenarios helps design more resilient systems.

Duplicate Points

When duplicate points exist (identical feature values), distance-based density may become ill-defined:

Distance to neighbor = 0 implies infinite density
k-distance may be 0 if k duplicates exist

Solutions:

Add small random noise (jittering)
Use distance + small ε to avoid division by zero
Collapse duplicates and weight by count
Use rank-based methods instead of distance-based

Extreme Class Imbalance in Neighborhoods

In some datasets, neighborhoods may contain extremely heterogeneous densities:

One point might have neighbors at distances [0.01, 0.02, 0.03, 10.0, 15.0]
The distribution of neighbor distances is highly skewed

Solutions:

Use robust aggregation (median instead of mean)
Apply logarithmic transformation to distances
Consider only k_min closest neighbors
Use adaptive neighborhood sizes

The Curse of Symmetry

Boundary Effects

Points on the boundary of the data manifold have neighbors only on one side, potentially leading to:

Lower apparent density than interior points
False positive anomalies at cluster edges

Solutions:

Consider boundary-aware density estimation
Use convex hull detection to identify boundary points
Adjust k based on local geometry
Ensemble with other methods that don't rely on density

Density Plateaus

When large regions have uniform density, relative density approaches 1 for all points, providing no discrimination:

All points in a uniform cluster have RelDensity ≈ 1
True anomaly embedded in uniform region still has RelDensity ≈ 1 if k is too large

Solutions:

Use smaller k to increase sensitivity
Combine density with distance-to-boundary measures
Apply hierarchical approaches that examine multiple scales

Streaming and Non-Stationary Data

When data arrives as a stream or the underlying distribution changes:

Historical density estimates become stale
Newly normal patterns may appear anomalous initially
Concept drift invalidates fixed models

Solutions:

Maintain sliding window of reference data
Use incremental nearest-neighbor structures
Implement decay factors for old data
Periodic model retraining with drift detection

Robustness Checklist for Density-Based Systems

•Duplicate handling: Ensure no division by zero when duplicates exist
•Scale normalization: Standardize or normalize features before computing distances
•k validation: Verify k < number of samples; handle small datasets gracefully
•Distance validity: Handle NaN/Inf distances from missing features
•Numerical stability: Add epsilon to denominators; use log-scale for extreme values
•Memory bounds: Consider approximate methods for very large datasets
•Boundary awareness: Document expected behavior at data manifold boundaries
•Concept drift: Plan for model updates in non-stationary environments

Summary: The Foundation of Relative Density

We have established the theoretical foundation for density-based anomaly detection, with relative density as the central concept.

Key Mathematical Insights

Density is not inherently meaningful for anomaly detection. Absolute density varies naturally across datasets. A point in a sparse region isn't necessarily anomalous—it might simply belong to a low-density cluster.
Relative density provides local context. By comparing a point's density to its neighbors' densities, we can distinguish between:
- Normal points in sparse regions (RelDensity ≈ 1)
- Anomalous points in any region (RelDensity << 1)
The choice of k controls the scale of locality. Small k provides fine-grained, high-variance estimates; large k provides coarse, low-variance estimates. Practitioners must balance sensitivity against stability.
Distance metrics define the neighborhood structure. The choice of metric (Euclidean, Manhattan, Mahalanobis, etc.) determines which points are considered neighbors, fundamentally shaping the density landscape.

Core Principle Established

Looking Ahead

With relative density as our conceptual foundation, we now have the tools to understand specific density-based anomaly detection algorithms:

Next Page (DBSCAN for Outliers): We'll explore how the clustering algorithm DBSCAN naturally identifies noise points as anomalies, leveraging density concepts in a different way than LOF.
Subsequent Topics: Local density estimation techniques, multivariate density considerations, and scalability strategies for production deployments.

1 / 5