Loading learning content...
DBSCAN is elegant and powerful, but it carries a fundamental limitation: the single ε constraint. Every cluster must exhibit similar density—dense enough for core points to connect within ε-distance. When real-world data contains clusters of varying densities, a single ε cannot capture them all:
This is not merely a parameter-tuning inconvenience—it reflects a fundamental mismatch between the algorithm's assumptions and common data characteristics. Many natural datasets exhibit multi-scale clustering structure with clusters spanning orders of magnitude in density.
OPTICS (Ordering Points To Identify the Clustering Structure), introduced by Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander in 1999, provides an elegant solution. Rather than producing clusters directly, OPTICS produces an ordering of points that encodes clustering structure at all density levels, from which clusters can be extracted at any desired threshold.
By the end of this page, you will understand OPTICS' core concepts (core-distance and reachability-distance), master the algorithm's ordering procedure, learn to interpret reachability plots, and understand how to extract clusters at multiple density levels. You'll gain the theoretical depth needed to apply OPTICS effectively and understand its relationship to DBSCAN.
The Multi-Density Problem:
Consider a dataset with three clusters:
With DBSCAN:
No single ε captures the true three-cluster structure.
OPTICS' key insight: instead of committing to one ε, compute a 'reachability profile' that captures structure at ALL density scales. Think of it as running DBSCAN at every possible ε simultaneously and encoding the results in a single ordering.
From Clustering to Ordering:
OPTICS fundamentally changes the output type:
| DBSCAN | OPTICS |
|---|---|
| Input: ε, MinPts | Input: ε_max, MinPts |
| Output: Cluster labels | Output: Ordered list + reachability values |
| Fixed density threshold | Variable density thresholds |
| One clustering | Multiple clusterings extractable |
The OPTICS ordering has a remarkable property: points in the same cluster appear consecutively in the ordering, with cluster boundaries marked by 'valleys' in the reachability values. This ordering can be visualized as a reachability plot, which reveals the hierarchical cluster structure at a glance.
OPTICS introduces two fundamental concepts that generalize DBSCAN's fixed-ε neighborhoods:
Core Distance:
The core distance of a point p with respect to MinPts, denoted $\text{core-dist}_{\text{MinPts}}(p)$, is the smallest ε value at which p would become a core point:
$$\text{core-dist}{\text{MinPts}}(p) = \begin{cases} \text{UNDEFINED} & \text{if } |N{\varepsilon_{\max}}(p)| < \text{MinPts} \ d(p, \text{MinPts-th nearest neighbor}) & \text{otherwise} \end{cases}$$
In other words, it's the distance to p's MinPts-th nearest neighbor. If p doesn't have MinPts neighbors within ε_max, its core distance is undefined.
Interpretation: Core distance tells us the 'local density' at point p. Small core distance means p is in a dense region (neighbors are close). Large core distance means p is in a sparse region.
If MinPts = 5 and point p's 5 nearest neighbors are at distances [0.1, 0.2, 0.3, 0.35, 0.5], then core-dist(p) = 0.5 (the 5th nearest distance). At any ε ≥ 0.5, p would be a core point.
Reachability Distance:
The reachability distance of point q with respect to point p is:
$$\text{reach-dist}{\text{MinPts}}(q, p) = \max(\text{core-dist}{\text{MinPts}}(p), d(p, q))$$
This is a key asymmetric measure. It's defined only when p has a defined core distance (p could be a core point at some ε ≤ ε_max).
Interpretation: The reachability distance answers: "What's the minimum ε at which q would be directly density-reachable from p?"
| Scenario | core-dist(p) | d(p,q) | reach-dist(q,p) | Explanation |
|---|---|---|---|---|
| Dense p, nearby q | 0.3 | 0.2 | 0.3 | q is closer than core threshold, limited by core-dist |
| Dense p, far q | 0.3 | 0.8 | 0.8 | q is farther than core threshold, limited by d(p,q) |
| Sparse p, nearby q | 1.5 | 0.2 | 1.5 | p needs large ε to be core, dominates reach-dist |
| Sparse p, far q | 1.5 | 2.0 | 2.0 | Both p's sparsity and q's distance are large |
Why Reachability Distance Matters:
The reachability distance smooths out the 'density transition' between regions:
This pattern creates the characteristic valleys and peaks in reachability plots that reveal cluster structure.
OPTICS produces an ordering of points such that spatially close points (at any density scale) appear near each other in the ordering. The algorithm maintains and outputs two arrays:
High-Level Algorithm:
OPTICS(D, ε_max, MinPts):
Compute core-distance for all points in D
ordered_list ← []
reachability ← {} // maps point → reachability distance
processed ← {} // tracks which points are done
for each unprocessed point p in D:
mark p as processed
append p to ordered_list
reachability[p] ← INFINITY (or UNDEFINED)
if core-dist(p) is defined:
seeds ← empty priority queue (min-heap by reach-dist)
update(seeds, p, processed) // add p's neighbors
while seeds is not empty:
q ← extract-min(seeds)
mark q as processed
append q to ordered_list
reachability[q] ← q's priority value
if core-dist(q) is defined:
update(seeds, q, processed)
return ordered_list, reachability
The Update Procedure:
update(seeds, p, processed):
for each q in N_{ε_max}(p):
if q is not processed:
new_reach_dist ← reach-dist(q, p)
if q is not in seeds:
insert q into seeds with priority new_reach_dist
else if new_reach_dist < current priority of q:
decrease priority of q to new_reach_dist
Each neighbor q of p is either added to the seed set or has its reachability distance improved if p offers a better (smaller) path. This greedy expansion explores dense regions first, naturally grouping similar-density points.
Algorithm Properties:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243
import numpy as npimport heapqfrom typing import List, Tuple, Optionalfrom dataclasses import dataclass, field @dataclassclass OPTICSResult: """Result of OPTICS algorithm.""" ordering: np.ndarray # Indices in cluster order reachability: np.ndarray # Reachability distance for each point core_distances: np.ndarray # Core distance for each point predecessor: np.ndarray # Which point reached each point class OPTICS: """ OPTICS: Ordering Points To Identify the Clustering Structure Produces an ordering of points that reveals clustering structure at multiple density levels. Parameters ---------- max_eps : float Maximum neighborhood radius. Points farther than this are never considered neighbors. min_pts : int Minimum points for core point definition. Attributes ---------- ordering_ : ndarray The OPTICS ordering of points. reachability_ : ndarray Reachability distance for each point in ordering. """ UNDEFINED = float('inf') def __init__(self, max_eps: float = float('inf'), min_pts: int = 5): self.max_eps = max_eps self.min_pts = min_pts self.ordering_ = None self.reachability_ = None self.core_distances_ = None def fit(self, X: np.ndarray) -> 'OPTICS': """ Compute OPTICS ordering and reachability. Parameters ---------- X : ndarray of shape (n_samples, n_features) Input data. Returns ------- self """ n_samples = X.shape[0] # Compute all pairwise distances (for simplicity) # In production, use spatial indexing distances = self._compute_distances(X) # Compute core distances core_distances = self._compute_core_distances(distances) # OPTICS main loop processed = np.zeros(n_samples, dtype=bool) ordering = [] reachability = np.full(n_samples, self.UNDEFINED) predecessor = np.full(n_samples, -1, dtype=int) for i in range(n_samples): if processed[i]: continue # Start processing from point i self._process_point( i, X, distances, core_distances, processed, ordering, reachability, predecessor ) # Convert ordering to array form ordering = np.array(ordering) # Reachability in ordering sequence reachability_ordered = reachability[ordering] self.ordering_ = ordering self.reachability_ = reachability_ordered self.core_distances_ = core_distances return self def _compute_distances(self, X: np.ndarray) -> np.ndarray: """Compute pairwise Euclidean distances.""" sq_norms = np.sum(X ** 2, axis=1) distances = np.sqrt( sq_norms[:, np.newaxis] + sq_norms[np.newaxis, :] - 2 * X @ X.T ) return np.maximum(distances, 0) # Handle numerical issues def _compute_core_distances(self, distances: np.ndarray) -> np.ndarray: """Compute core distance for each point.""" n_samples = distances.shape[0] core_distances = np.full(n_samples, self.UNDEFINED) for i in range(n_samples): # Get distances to all neighbors within max_eps dists = distances[i] within_eps = dists[dists <= self.max_eps] if len(within_eps) >= self.min_pts: # Core distance is distance to min_pts-th nearest neighbor sorted_dists = np.sort(within_eps) core_distances[i] = sorted_dists[self.min_pts - 1] return core_distances def _process_point( self, point_idx: int, X: np.ndarray, distances: np.ndarray, core_distances: np.ndarray, processed: np.ndarray, ordering: List[int], reachability: np.ndarray, predecessor: np.ndarray ) -> None: """Process a point and expand its neighborhood.""" # Priority queue: (reachability_dist, point_index) # Using list as heap with heapq seeds = [] processed[point_idx] = True ordering.append(point_idx) if core_distances[point_idx] < self.UNDEFINED: # This point can be a core point - expand self._update_seeds( point_idx, distances, core_distances, processed, seeds, reachability, predecessor ) while seeds: _, next_idx = heapq.heappop(seeds) if processed[next_idx]: continue processed[next_idx] = True ordering.append(next_idx) if core_distances[next_idx] < self.UNDEFINED: self._update_seeds( next_idx, distances, core_distances, processed, seeds, reachability, predecessor ) def _update_seeds( self, center_idx: int, distances: np.ndarray, core_distances: np.ndarray, processed: np.ndarray, seeds: List[Tuple[float, int]], reachability: np.ndarray, predecessor: np.ndarray ) -> None: """Update seed set with neighbors of center point.""" core_dist = core_distances[center_idx] neighbors = np.where(distances[center_idx] <= self.max_eps)[0] for neighbor_idx in neighbors: if processed[neighbor_idx]: continue # Compute reachability distance dist = distances[center_idx, neighbor_idx] new_reach_dist = max(core_dist, dist) if new_reach_dist < reachability[neighbor_idx]: reachability[neighbor_idx] = new_reach_dist predecessor[neighbor_idx] = center_idx heapq.heappush(seeds, (new_reach_dist, neighbor_idx)) def extract_dbscan_clusters( optics_result: OPTICS, eps: float) -> np.ndarray: """ Extract DBSCAN-like clusters at a specific epsilon threshold. Points with reachability > eps start new clusters. """ ordering = optics_result.ordering_ reachability = optics_result.reachability_ core_distances = optics_result.core_distances_ n_samples = len(ordering) labels = np.full(n_samples, -1, dtype=int) # -1 for noise cluster_id = -1 for i, point_idx in enumerate(ordering): if reachability[i] > eps: # This point is not density-reachable at this eps if core_distances[point_idx] <= eps: # But it's a core point - start new cluster cluster_id += 1 labels[point_idx] = cluster_id # else: it's noise else: # Density-reachable - add to current cluster labels[point_idx] = cluster_id return labels # Example usageif __name__ == "__main__": np.random.seed(42) # Create clusters with different densities cluster1 = np.random.normal([0, 0], 0.3, (100, 2)) # Dense cluster2 = np.random.normal([4, 4], 1.0, (80, 2)) # Sparse cluster3 = np.random.normal([8, 0], 0.5, (60, 2)) # Medium noise = np.random.uniform(-2, 10, (20, 2)) X = np.vstack([cluster1, cluster2, cluster3, noise]) # Run OPTICS optics = OPTICS(max_eps=3.0, min_pts=5) optics.fit(X) print("OPTICS ordering computed") print(f"Reachability range: {optics.reachability_.min():.3f} to {optics.reachability_.max():.3f}")The reachability plot is OPTICS' signature visualization. It's a bar plot where:
Interpreting the Reachability Plot:
The plot reveals cluster structure through its characteristic shape:
Valleys = Clusters: Sequences of low reachability values indicate dense, connected regions—clusters. Points within a valley are mutually reachable at low ε thresholds.
Peaks = Cluster Boundaries: Spikes in reachability indicate transitions between clusters. The height of the peak corresponds to the 'gap' between clusters—higher peaks mean more separation.
Plateau = Noise: Flat regions at high reachability values often represent noise or sparse, unstructured regions.
Imagine drawing a horizontal line at some height ε on the reachability plot. Every connected segment below the line (where the bars are shorter than ε) corresponds to a cluster. Moving the line up or down changes which clusters you see—lower lines find only the densest clusters, higher lines merge adjacent clusters.
Hierarchical Structure:
The reachability plot encodes a hierarchical clustering:
Example Patterns:
| Pattern | Meaning |
|---|---|
| Sharp V-shape | Well-separated cluster with clear boundaries |
| Broad U-shape | Loose cluster with fuzzy boundaries |
| Multiple nested Vs | Hierarchical sub-cluster structure |
| Saw-tooth pattern | Many small, similar-density clusters |
| Flat line at maximum | Noise or no structure at this scale |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163
import numpy as npimport matplotlib.pyplot as pltfrom matplotlib.patches import Rectanglefrom typing import List, Tuple, Optional def plot_reachability( reachability: np.ndarray, ordering: np.ndarray, labels: Optional[np.ndarray] = None, threshold: Optional[float] = None, title: str = "OPTICS Reachability Plot") -> plt.Figure: """ Create an OPTICS reachability plot. Parameters ---------- reachability : ndarray Reachability distances in ordering. ordering : ndarray OPTICS ordering of point indices. labels : ndarray, optional Cluster labels for coloring. threshold : float, optional Draw a horizontal line at this threshold. Returns ------- fig : matplotlib.Figure """ fig, ax = plt.subplots(figsize=(14, 6)) n = len(reachability) # Replace infinities for plotting reach_plot = reachability.copy() max_finite = np.max(reach_plot[np.isfinite(reach_plot)]) reach_plot[~np.isfinite(reach_plot)] = max_finite * 1.1 if labels is not None: # Color by cluster unique_labels = np.unique(labels) colors = plt.cm.tab20(np.linspace(0, 1, len(unique_labels))) color_map = {label: colors[i] for i, label in enumerate(unique_labels)} bar_colors = [color_map[labels[ordering[i]]] for i in range(n)] ax.bar(range(n), reach_plot, width=1.0, color=bar_colors, edgecolor='none') else: ax.bar(range(n), reach_plot, width=1.0, color='steelblue', edgecolor='none') if threshold is not None: ax.axhline(y=threshold, color='red', linestyle='--', linewidth=2, label=f'ε = {threshold}') ax.legend() ax.set_xlabel('Points (OPTICS ordering)', fontsize=12) ax.set_ylabel('Reachability Distance', fontsize=12) ax.set_title(title, fontsize=14) ax.set_xlim(0, n) ax.set_ylim(0, max_finite * 1.15) plt.tight_layout() return fig def annotate_clusters( ax: plt.Axes, reachability: np.ndarray, cluster_spans: List[Tuple[int, int, str]]) -> None: """ Add cluster annotations to a reachability plot. Parameters ---------- ax : matplotlib Axes The plot axes. cluster_spans : list of (start, end, label) Cluster regions to annotate. """ max_reach = np.max(reachability[np.isfinite(reachability)]) for start, end, label in cluster_spans: # Draw bracket mid = (start + end) / 2 ax.annotate( label, xy=(mid, 0), xytext=(mid, -max_reach * 0.15), ha='center', fontsize=10, fontweight='bold', arrowprops=dict(arrowstyle='-', color='gray') ) # Draw span line ax.hlines( y=-max_reach * 0.08, xmin=start, xmax=end, color='gray', linewidth=2 ) def interactive_threshold_analysis( reachability: np.ndarray, ordering: np.ndarray, core_distances: np.ndarray, thresholds: List[float]) -> None: """ Show cluster extraction at multiple thresholds. """ n_thresholds = len(thresholds) fig, axes = plt.subplots(n_thresholds, 1, figsize=(14, 4 * n_thresholds)) if n_thresholds == 1: axes = [axes] for ax, eps in zip(axes, thresholds): # Extract clusters at this threshold labels = np.full(len(ordering), -1, dtype=int) labels_ordered = np.full(len(ordering), -1, dtype=int) cluster_id = -1 for i, point_idx in enumerate(ordering): if reachability[i] > eps: if core_distances[point_idx] <= eps: cluster_id += 1 labels_ordered[i] = cluster_id labels[point_idx] = cluster_id else: labels_ordered[i] = cluster_id labels[point_idx] = cluster_id # Plot reach_plot = reachability.copy() max_finite = np.max(reach_plot[np.isfinite(reach_plot)]) reach_plot[~np.isfinite(reach_plot)] = max_finite * 1.1 # Color by cluster n_clusters = cluster_id + 1 colors = plt.cm.tab20(np.linspace(0, 1, max(n_clusters + 1, 2))) bar_colors = [] for i in range(len(ordering)): if labels_ordered[i] == -1: bar_colors.append('gray') else: bar_colors.append(colors[labels_ordered[i] % 20]) ax.bar(range(len(ordering)), reach_plot, width=1.0, color=bar_colors, edgecolor='none') ax.axhline(y=eps, color='red', linestyle='--', linewidth=2) n_noise = np.sum(labels == -1) ax.set_title(f'ε = {eps:.2f}: {n_clusters} clusters, {n_noise} noise points') ax.set_ylabel('Reachability') ax.set_xlim(0, len(ordering)) axes[-1].set_xlabel('OPTICS Ordering') plt.tight_layout() plt.show()The OPTICS ordering is not a clustering per se—it's a data structure from which clusterings can be extracted. Several methods exist for extracting meaningful clusters:
Method 1: DBSCAN-Style Extraction (Threshold Method)
The simplest approach: pick a reachability threshold ε and treat all points with reachability ≤ ε as clustered:
This is equivalent to running DBSCAN with that ε value.
Advantages: Simple, interpretable, DBSCAN-compatible Disadvantages: Requires choosing a single threshold, doesn't capture hierarchical structure
Method 2: Xi Cluster Extraction
The Xi method automatically detects significant clusters by finding steep downward and upward gradients in the reachability plot:
Clusters are the regions between matching SDA-SUA pairs. The steepness threshold ξ (xi) controls what counts as 'significant':
$$\text{steep_down}(i) \Leftrightarrow \text{reach}(i+1) \leq \text{reach}(i) \cdot (1 - \xi)$$
$$\text{steep_up}(i) \Leftrightarrow \text{reach}(i+1) \geq \text{reach}(i) \cdot (1 + \xi) / (1 - \xi)$$
Advantages: Automatic, can extract hierarchical clusters Disadvantages: More complex, ξ parameter still needs tuning
Common Xi values range from 0.01 to 0.1. Lower values (0.01-0.03) find more, smaller clusters. Higher values (0.05-0.1) find fewer, larger clusters. Xi = 0.05 is a reasonable starting point for many datasets.
While OPTICS generalizes DBSCAN, they serve different purposes and have different trade-offs:
| Criterion | DBSCAN | OPTICS |
|---|---|---|
| Primary output | Cluster labels | Ordering + reachability values |
| Density handling | Single density scale | All density scales |
| Parameters | ε, MinPts | ε_max, MinPts (+ extraction params) |
| Time complexity | O(n log n) | O(n log n) with same indexing |
| Space complexity | O(n) | O(n) for ordering + values |
| Cluster hierarchy | Not available | Fully captured |
| Interpretability | Direct cluster assignments | Requires extraction step |
| Parameter sensitivity | High (must choose ε) | Lower (ε_max is an upper bound) |
A recommended workflow: (1) Run OPTICS to understand the density structure, (2) Examine reachability plot to identify natural density thresholds, (3) Either extract from OPTICS at chosen threshold or run DBSCAN with informed ε choice. OPTICS serves as an intelligent preprocessing step for DBSCAN parameter selection.
Time Complexity Analysis:
OPTICS has similar complexity to DBSCAN:
Naive Implementation:
With Spatial Indexing (KD-tree/Ball-tree):
Space Complexity:
Choosing ε_max involves a trade-off: too small might miss sparse clusters, too large increases computation. A good strategy: set ε_max based on the k-distance plot (like DBSCAN's ε selection), where k = MinPts. The 'elbow' suggests a reasonable upper bound.
OPTICS represents an elegant generalization of DBSCAN that addresses the fundamental limitation of fixed-density clustering. Let's consolidate the key concepts:
What's Next:
While OPTICS advances beyond DBSCAN's single-scale limitation, it still requires parameter selection (MinPts, ε_max, extraction parameters). HDBSCAN (Hierarchical DBSCAN) takes the final step: automatically extracting an optimal flat clustering from the hierarchical structure, with minimal user intervention.
You now understand OPTICS' theoretical foundations, algorithmic mechanics, and practical applications. You can interpret reachability plots, extract clusters at various thresholds, and understand when OPTICS is preferable to DBSCAN. Next, we'll explore HDBSCAN—the state-of-the-art evolution of density-based clustering.