Machine LearningDensity-Based Clustering

Density-Based Clustering

LevelIntermediate

Duration120 mins

TopicDensity-Based Clustering

3 / 5

OPTICS Algorithm

Beyond Fixed Epsilon: The Multi-Scale Challenge

DBSCAN is elegant and powerful, but it carries a fundamental limitation: the single ε constraint. Every cluster must exhibit similar density—dense enough for core points to connect within ε-distance. When real-world data contains clusters of varying densities, a single ε cannot capture them all:

ε too small: Dense clusters are found, but sparser clusters shatter into noise
ε too large: Sparse clusters are found, but dense clusters merge inappropriately

This is not merely a parameter-tuning inconvenience—it reflects a fundamental mismatch between the algorithm's assumptions and common data characteristics. Many natural datasets exhibit multi-scale clustering structure with clusters spanning orders of magnitude in density.

OPTICS (Ordering Points To Identify the Clustering Structure), introduced by Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander in 1999, provides an elegant solution. Rather than producing clusters directly, OPTICS produces an ordering of points that encodes clustering structure at all density levels, from which clusters can be extracted at any desired threshold.

What You Will Learn

By the end of this page, you will understand OPTICS' core concepts (core-distance and reachability-distance), master the algorithm's ordering procedure, learn to interpret reachability plots, and understand how to extract clusters at multiple density levels. You'll gain the theoretical depth needed to apply OPTICS effectively and understand its relationship to DBSCAN.

Motivation and Key Insight

The Multi-Density Problem:

Consider a dataset with three clusters:

Cluster A: 1000 points in a 1-unit radius ball (very dense)
Cluster B: 200 points in a 5-unit radius ball (moderate density)
Cluster C: 50 points in a 10-unit radius ball (sparse)

With DBSCAN:

ε = 0.1: Only Cluster A is found; B and C are noise
ε = 0.5: A and B are found; C is noise; A might fragment
ε = 1.0: All three might merge into one cluster

No single ε captures the true three-cluster structure.

The OPTICS Insight

OPTICS' key insight: instead of committing to one ε, compute a 'reachability profile' that captures structure at ALL density scales. Think of it as running DBSCAN at every possible ε simultaneously and encoding the results in a single ordering.

From Clustering to Ordering:

OPTICS fundamentally changes the output type:

DBSCAN	OPTICS
Input: ε, MinPts	Input: ε_max, MinPts
Output: Cluster labels	Output: Ordered list + reachability values
Fixed density threshold	Variable density thresholds
One clustering	Multiple clusterings extractable

The OPTICS ordering has a remarkable property: points in the same cluster appear consecutively in the ordering, with cluster boundaries marked by 'valleys' in the reachability values. This ordering can be visualized as a reachability plot, which reveals the hierarchical cluster structure at a glance.

Core Distance and Reachability Distance

OPTICS introduces two fundamental concepts that generalize DBSCAN's fixed-ε neighborhoods:

Core Distance:

The core distance of a point p with respect to MinPts, denoted $\text{core-dist}_{\text{MinPts}}(p)$, is the smallest ε value at which p would become a core point:

$$\text{core-dist}{\text{MinPts}}(p) = \begin{cases} \text{UNDEFINED} & \text{if } |N{\varepsilon_{\max}}(p)| < \text{MinPts} \ d(p, \text{MinPts-th nearest neighbor}) & \text{otherwise} \end{cases}$$

In other words, it's the distance to p's MinPts-th nearest neighbor. If p doesn't have MinPts neighbors within ε_max, its core distance is undefined.

Interpretation: Core distance tells us the 'local density' at point p. Small core distance means p is in a dense region (neighbors are close). Large core distance means p is in a sparse region.

Core Distance Example

If MinPts = 5 and point p's 5 nearest neighbors are at distances [0.1, 0.2, 0.3, 0.35, 0.5], then core-dist(p) = 0.5 (the 5th nearest distance). At any ε ≥ 0.5, p would be a core point.

Reachability Distance:

The reachability distance of point q with respect to point p is:

$$\text{reach-dist}{\text{MinPts}}(q, p) = \max(\text{core-dist}{\text{MinPts}}(p), d(p, q))$$

This is a key asymmetric measure. It's defined only when p has a defined core distance (p could be a core point at some ε ≤ ε_max).

Interpretation: The reachability distance answers: "What's the minimum ε at which q would be directly density-reachable from p?"

If d(p, q) > core-dist(p): The reachability distance is simply d(p, q)—we need ε that large to include q in p's neighborhood.
If d(p, q) ≤ core-dist(p): The reachability distance is core-dist(p)—even though q is close, p needs ε = core-dist(p) to become core.

Core and Reachability Distance Examples
Scenario	core-dist(p)	d(p,q)	reach-dist(q,p)	Explanation
Dense p, nearby q	0.3	0.2	0.3	q is closer than core threshold, limited by core-dist
Dense p, far q	0.3	0.8	0.8	q is farther than core threshold, limited by d(p,q)
Sparse p, nearby q	1.5	0.2	1.5	p needs large ε to be core, dominates reach-dist
Sparse p, far q	1.5	2.0	2.0	Both p's sparsity and q's distance are large

Why Reachability Distance Matters:

The reachability distance smooths out the 'density transition' between regions:

Within dense clusters: Reachability distances are small (limited by small core distances)
At cluster boundaries: Reachability distances spike (transitioning to sparse regions)
Between clusters: Reachability distances are large (sparse void between clusters)

This pattern creates the characteristic valleys and peaks in reachability plots that reveal cluster structure.

The OPTICS Algorithm

OPTICS produces an ordering of points such that spatially close points (at any density scale) appear near each other in the ordering. The algorithm maintains and outputs two arrays:

Ordered list: The sequence in which points are processed
Reachability values: The reachability distance at which each point was reached

High-Level Algorithm:

OPTICS(D, ε_max, MinPts):
    Compute core-distance for all points in D
    
    ordered_list ← []
    reachability ← {} // maps point → reachability distance
    processed ← {} // tracks which points are done
    
    for each unprocessed point p in D:
        mark p as processed
        append p to ordered_list
        reachability[p] ← INFINITY (or UNDEFINED)
        
        if core-dist(p) is defined:
            seeds ← empty priority queue (min-heap by reach-dist)
            update(seeds, p, processed)  // add p's neighbors
            
            while seeds is not empty:
                q ← extract-min(seeds)
                mark q as processed
                append q to ordered_list
                reachability[q] ← q's priority value
                
                if core-dist(q) is defined:
                    update(seeds, q, processed)
    
    return ordered_list, reachability

The Update Procedure:

update(seeds, p, processed):
    for each q in N_{ε_max}(p):
        if q is not processed:
            new_reach_dist ← reach-dist(q, p)
            
            if q is not in seeds:
                insert q into seeds with priority new_reach_dist
            else if new_reach_dist < current priority of q:
                decrease priority of q to new_reach_dist

Each neighbor q of p is either added to the seed set or has its reachability distance improved if p offers a better (smaller) path. This greedy expansion explores dense regions first, naturally grouping similar-density points.

Algorithm Properties:

Key Properties

•Density-First Expansion: The priority queue ensures we always process the point with smallest reachability distance next—staying within dense regions as long as possible.
•Lazy Cluster Discovery: Points from one cluster are fully processed before 'jumping' to another cluster, creating natural cluster groupings in the ordering.
•Captures All Scales: The reachability values encode clustering structure from the finest (densest) to coarsest (sparsest) levels.
•DBSCAN as Special Case: Extracting clusters at a fixed reachability threshold is equivalent to running DBSCAN at that ε.

optics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
import numpy as np
import heapq
from typing import List, Tuple, Optional
from dataclasses import dataclass, field
 
@dataclass
class OPTICSResult:
    """Result of OPTICS algorithm."""
    ordering: np.ndarray           # Indices in cluster order
    reachability: np.ndarray       # Reachability distance for each point
    core_distances: np.ndarray     # Core distance for each point
    predecessor: np.ndarray        # Which point reached each point
 
 
class OPTICS:
    """
    OPTICS: Ordering Points To Identify the Clustering Structure
    
    Produces an ordering of points that reveals clustering structure
    at multiple density levels.
    
    Parameters
    ----------
    max_eps : float
        Maximum neighborhood radius. Points farther than this are never
        considered neighbors.
    min_pts : int
        Minimum points for core point definition.
    
    Attributes
    ----------
    ordering_ : ndarray
        The OPTICS ordering of points.
    reachability_ : ndarray
        Reachability distance for each point in ordering.
    """
    
    UNDEFINED = float('inf')
    
    def __init__(self, max_eps: float = float('inf'), min_pts: int = 5):
        self.max_eps = max_eps
        self.min_pts = min_pts
        self.ordering_ = None
        self.reachability_ = None
        self.core_distances_ = None
    
    def fit(self, X: np.ndarray) -> 'OPTICS':
        """
        Compute OPTICS ordering and reachability.
        
        Parameters
        ----------
        X : ndarray of shape (n_samples, n_features)
            Input data.
        
        Returns
        -------
        self
        """
        n_samples = X.shape[0]
        
        # Compute all pairwise distances (for simplicity)
        # In production, use spatial indexing
        distances = self._compute_distances(X)
        
        # Compute core distances
        core_distances = self._compute_core_distances(distances)
        
        # OPTICS main loop
        processed = np.zeros(n_samples, dtype=bool)
        ordering = []
        reachability = np.full(n_samples, self.UNDEFINED)
        predecessor = np.full(n_samples, -1, dtype=int)
        
        for i in range(n_samples):
            if processed[i]:
                continue
            
            # Start processing from point i
            self._process_point(
                i, X, distances, core_distances,
                processed, ordering, reachability, predecessor
            )
        
        # Convert ordering to array form
        ordering = np.array(ordering)
        
        # Reachability in ordering sequence
        reachability_ordered = reachability[ordering]
        
        self.ordering_ = ordering
        self.reachability_ = reachability_ordered
        self.core_distances_ = core_distances
        
        return self
    
    def _compute_distances(self, X: np.ndarray) -> np.ndarray:
        """Compute pairwise Euclidean distances."""
        sq_norms = np.sum(X ** 2, axis=1)
        distances = np.sqrt(
            sq_norms[:, np.newaxis] + sq_norms[np.newaxis, :] - 
            2 * X @ X.T
        )
        return np.maximum(distances, 0)  # Handle numerical issues
    
    def _compute_core_distances(self, distances: np.ndarray) -> np.ndarray:
        """Compute core distance for each point."""
        n_samples = distances.shape[0]
        core_distances = np.full(n_samples, self.UNDEFINED)
        
        for i in range(n_samples):
            # Get distances to all neighbors within max_eps
            dists = distances[i]
            within_eps = dists[dists <= self.max_eps]
            
            if len(within_eps) >= self.min_pts:
                # Core distance is distance to min_pts-th nearest neighbor
                sorted_dists = np.sort(within_eps)
                core_distances[i] = sorted_dists[self.min_pts - 1]
        
        return core_distances
    
    def _process_point(
        self,
        point_idx: int,
        X: np.ndarray,
        distances: np.ndarray,
        core_distances: np.ndarray,
        processed: np.ndarray,
        ordering: List[int],
        reachability: np.ndarray,
        predecessor: np.ndarray
    ) -> None:
        """Process a point and expand its neighborhood."""
        
        # Priority queue: (reachability_dist, point_index)
        # Using list as heap with heapq
        seeds = []
        
        processed[point_idx] = True
        ordering.append(point_idx)
        
        if core_distances[point_idx] < self.UNDEFINED:
            # This point can be a core point - expand
            self._update_seeds(
                point_idx, distances, core_distances,
                processed, seeds, reachability, predecessor
            )
            
            while seeds:
                _, next_idx = heapq.heappop(seeds)
                
                if processed[next_idx]:
                    continue
                
                processed[next_idx] = True
                ordering.append(next_idx)
                
                if core_distances[next_idx] < self.UNDEFINED:
                    self._update_seeds(
                        next_idx, distances, core_distances,
                        processed, seeds, reachability, predecessor
                    )
    
    def _update_seeds(
        self,
        center_idx: int,
        distances: np.ndarray,
        core_distances: np.ndarray,
        processed: np.ndarray,
        seeds: List[Tuple[float, int]],
        reachability: np.ndarray,
        predecessor: np.ndarray
    ) -> None:
        """Update seed set with neighbors of center point."""
        
        core_dist = core_distances[center_idx]
        neighbors = np.where(distances[center_idx] <= self.max_eps)[0]
        
        for neighbor_idx in neighbors:
            if processed[neighbor_idx]:
                continue
            
            # Compute reachability distance
            dist = distances[center_idx, neighbor_idx]
            new_reach_dist = max(core_dist, dist)
            
            if new_reach_dist < reachability[neighbor_idx]:
                reachability[neighbor_idx] = new_reach_dist
                predecessor[neighbor_idx] = center_idx
                heapq.heappush(seeds, (new_reach_dist, neighbor_idx))
 
 
def extract_dbscan_clusters(
    optics_result: OPTICS,
    eps: float
) -> np.ndarray:
    """
    Extract DBSCAN-like clusters at a specific epsilon threshold.
    
    Points with reachability > eps start new clusters.
    """
    ordering = optics_result.ordering_
    reachability = optics_result.reachability_
    core_distances = optics_result.core_distances_
    
    n_samples = len(ordering)
    labels = np.full(n_samples, -1, dtype=int)  # -1 for noise
    cluster_id = -1
    
    for i, point_idx in enumerate(ordering):
        if reachability[i] > eps:
            # This point is not density-reachable at this eps
            if core_distances[point_idx] <= eps:
                # But it's a core point - start new cluster
                cluster_id += 1
                labels[point_idx] = cluster_id
            # else: it's noise
        else:
            # Density-reachable - add to current cluster
            labels[point_idx] = cluster_id
    
    return labels
 
 
# Example usage
if __name__ == "__main__":
    np.random.seed(42)
    
    # Create clusters with different densities
    cluster1 = np.random.normal([0, 0], 0.3, (100, 2))  # Dense
    cluster2 = np.random.normal([4, 4], 1.0, (80, 2))   # Sparse
    cluster3 = np.random.normal([8, 0], 0.5, (60, 2))   # Medium
    noise = np.random.uniform(-2, 10, (20, 2))
    
    X = np.vstack([cluster1, cluster2, cluster3, noise])
    
    # Run OPTICS
    optics = OPTICS(max_eps=3.0, min_pts=5)
    optics.fit(X)
    
    print("OPTICS ordering computed")
    print(f"Reachability range: {optics.reachability_.min():.3f} to {optics.reachability_.max():.3f}")

Reachability Plots: Visualizing Cluster Structure

The reachability plot is OPTICS' signature visualization. It's a bar plot where:

X-axis: Points in OPTICS ordering (index 0 to n-1)
Y-axis: Reachability distance of each point

Interpreting the Reachability Plot:

The plot reveals cluster structure through its characteristic shape:

Valleys = Clusters: Sequences of low reachability values indicate dense, connected regions—clusters. Points within a valley are mutually reachable at low ε thresholds.

Peaks = Cluster Boundaries: Spikes in reachability indicate transitions between clusters. The height of the peak corresponds to the 'gap' between clusters—higher peaks mean more separation.

Plateau = Noise: Flat regions at high reachability values often represent noise or sparse, unstructured regions.

Reading the Plot

Imagine drawing a horizontal line at some height ε on the reachability plot. Every connected segment below the line (where the bars are shorter than ε) corresponds to a cluster. Moving the line up or down changes which clusters you see—lower lines find only the densest clusters, higher lines merge adjacent clusters.

Hierarchical Structure:

The reachability plot encodes a hierarchical clustering:

Nested Valleys: A cluster (valley) may contain sub-clusters (smaller valleys within it)
Multiple Scales: Different horizontal thresholds reveal different granularities
Dendrogram Relationship: The reachability plot is related to single-linkage dendrograms but with density-based semantics

Example Patterns:

Pattern	Meaning
Sharp V-shape	Well-separated cluster with clear boundaries
Broad U-shape	Loose cluster with fuzzy boundaries
Multiple nested Vs	Hierarchical sub-cluster structure
Saw-tooth pattern	Many small, similar-density clusters
Flat line at maximum	Noise or no structure at this scale

reachability_plot.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
from typing import List, Tuple, Optional
 
def plot_reachability(
    reachability: np.ndarray,
    ordering: np.ndarray,
    labels: Optional[np.ndarray] = None,
    threshold: Optional[float] = None,
    title: str = "OPTICS Reachability Plot"
) -> plt.Figure:
    """
    Create an OPTICS reachability plot.
    
    Parameters
    ----------
    reachability : ndarray
        Reachability distances in ordering.
    ordering : ndarray
        OPTICS ordering of point indices.
    labels : ndarray, optional
        Cluster labels for coloring.
    threshold : float, optional
        Draw a horizontal line at this threshold.
    
    Returns
    -------
    fig : matplotlib.Figure
    """
    fig, ax = plt.subplots(figsize=(14, 6))
    
    n = len(reachability)
    
    # Replace infinities for plotting
    reach_plot = reachability.copy()
    max_finite = np.max(reach_plot[np.isfinite(reach_plot)])
    reach_plot[~np.isfinite(reach_plot)] = max_finite * 1.1
    
    if labels is not None:
        # Color by cluster
        unique_labels = np.unique(labels)
        colors = plt.cm.tab20(np.linspace(0, 1, len(unique_labels)))
        color_map = {label: colors[i] for i, label in enumerate(unique_labels)}
        
        bar_colors = [color_map[labels[ordering[i]]] for i in range(n)]
        ax.bar(range(n), reach_plot, width=1.0, color=bar_colors, edgecolor='none')
    else:
        ax.bar(range(n), reach_plot, width=1.0, color='steelblue', edgecolor='none')
    
    if threshold is not None:
        ax.axhline(y=threshold, color='red', linestyle='--', 
                   linewidth=2, label=f'ε = {threshold}')
        ax.legend()
    
    ax.set_xlabel('Points (OPTICS ordering)', fontsize=12)
    ax.set_ylabel('Reachability Distance', fontsize=12)
    ax.set_title(title, fontsize=14)
    ax.set_xlim(0, n)
    ax.set_ylim(0, max_finite * 1.15)
    
    plt.tight_layout()
    return fig
 
 
def annotate_clusters(
    ax: plt.Axes,
    reachability: np.ndarray,
    cluster_spans: List[Tuple[int, int, str]]
) -> None:
    """
    Add cluster annotations to a reachability plot.
    
    Parameters
    ----------
    ax : matplotlib Axes
        The plot axes.
    cluster_spans : list of (start, end, label)
        Cluster regions to annotate.
    """
    max_reach = np.max(reachability[np.isfinite(reachability)])
    
    for start, end, label in cluster_spans:
        # Draw bracket
        mid = (start + end) / 2
        ax.annotate(
            label,
            xy=(mid, 0),
            xytext=(mid, -max_reach * 0.15),
            ha='center',
            fontsize=10,
            fontweight='bold',
            arrowprops=dict(arrowstyle='-', color='gray')
        )
        
        # Draw span line
        ax.hlines(
            y=-max_reach * 0.08,
            xmin=start,
            xmax=end,
            color='gray',
            linewidth=2
        )
 
 
def interactive_threshold_analysis(
    reachability: np.ndarray,
    ordering: np.ndarray,
    core_distances: np.ndarray,
    thresholds: List[float]
) -> None:
    """
    Show cluster extraction at multiple thresholds.
    """
    n_thresholds = len(thresholds)
    fig, axes = plt.subplots(n_thresholds, 1, figsize=(14, 4 * n_thresholds))
    
    if n_thresholds == 1:
        axes = [axes]
    
    for ax, eps in zip(axes, thresholds):
        # Extract clusters at this threshold
        labels = np.full(len(ordering), -1, dtype=int)
        labels_ordered = np.full(len(ordering), -1, dtype=int)
        cluster_id = -1
        
        for i, point_idx in enumerate(ordering):
            if reachability[i] > eps:
                if core_distances[point_idx] <= eps:
                    cluster_id += 1
                    labels_ordered[i] = cluster_id
                    labels[point_idx] = cluster_id
            else:
                labels_ordered[i] = cluster_id
                labels[point_idx] = cluster_id
        
        # Plot
        reach_plot = reachability.copy()
        max_finite = np.max(reach_plot[np.isfinite(reach_plot)])
        reach_plot[~np.isfinite(reach_plot)] = max_finite * 1.1
        
        # Color by cluster
        n_clusters = cluster_id + 1
        colors = plt.cm.tab20(np.linspace(0, 1, max(n_clusters + 1, 2)))
        bar_colors = []
        for i in range(len(ordering)):
            if labels_ordered[i] == -1:
                bar_colors.append('gray')
            else:
                bar_colors.append(colors[labels_ordered[i] % 20])
        
        ax.bar(range(len(ordering)), reach_plot, width=1.0, 
               color=bar_colors, edgecolor='none')
        ax.axhline(y=eps, color='red', linestyle='--', linewidth=2)
        
        n_noise = np.sum(labels == -1)
        ax.set_title(f'ε = {eps:.2f}: {n_clusters} clusters, {n_noise} noise points')
        ax.set_ylabel('Reachability')
        ax.set_xlim(0, len(ordering))
    
    axes[-1].set_xlabel('OPTICS Ordering')
    plt.tight_layout()
    plt.show()

Cluster Extraction Methods

The OPTICS ordering is not a clustering per se—it's a data structure from which clusterings can be extracted. Several methods exist for extracting meaningful clusters:

Method 1: DBSCAN-Style Extraction (Threshold Method)

The simplest approach: pick a reachability threshold ε and treat all points with reachability ≤ ε as clustered:

Scan the ordering from left to right
When reachability[i] > ε and core_dist[i] ≤ ε: start new cluster
When reachability[i] > ε and core_dist[i] > ε: mark as noise
When reachability[i] ≤ ε: add to current cluster

This is equivalent to running DBSCAN with that ε value.

Advantages: Simple, interpretable, DBSCAN-compatible Disadvantages: Requires choosing a single threshold, doesn't capture hierarchical structure

Method 2: Xi Cluster Extraction

The Xi method automatically detects significant clusters by finding steep downward and upward gradients in the reachability plot:

Steep Down Area (SDA): A region where reachability is decreasing significantly—entering a cluster
Steep Up Area (SUA): A region where reachability is increasing significantly—leaving a cluster

Clusters are the regions between matching SDA-SUA pairs. The steepness threshold ξ (xi) controls what counts as 'significant':

$$\text{steep_down}(i) \Leftrightarrow \text{reach}(i+1) \leq \text{reach}(i) \cdot (1 - \xi)$$

$$\text{steep_up}(i) \Leftrightarrow \text{reach}(i+1) \geq \text{reach}(i) \cdot (1 + \xi) / (1 - \xi)$$

Advantages: Automatic, can extract hierarchical clusters Disadvantages: More complex, ξ parameter still needs tuning

Cluster Extraction Guidelines

•Use threshold method when you have a natural density scale from domain knowledge
•Use Xi method for automatic, parameter-free (mostly) cluster detection
•Try multiple thresholds to understand the hierarchical structure
•Match DBSCAN by using threshold = ε you would have used in DBSCAN
•Visual inspection of the reachability plot often suggests natural thresholds at 'valley floors'

The Xi Parameter

Common Xi values range from 0.01 to 0.1. Lower values (0.01-0.03) find more, smaller clusters. Higher values (0.05-0.1) find fewer, larger clusters. Xi = 0.05 is a reasonable starting point for many datasets.

OPTICS vs. DBSCAN: When to Use Which

While OPTICS generalizes DBSCAN, they serve different purposes and have different trade-offs:

OPTICS vs. DBSCAN Comparison
Criterion	DBSCAN	OPTICS
Primary output	Cluster labels	Ordering + reachability values
Density handling	Single density scale	All density scales
Parameters	ε, MinPts	ε_max, MinPts (+ extraction params)
Time complexity	O(n log n)	O(n log n) with same indexing
Space complexity	O(n)	O(n) for ordering + values
Cluster hierarchy	Not available	Fully captured
Interpretability	Direct cluster assignments	Requires extraction step
Parameter sensitivity	High (must choose ε)	Lower (ε_max is an upper bound)

Choose DBSCAN When:

•You know the appropriate density scale
•Clusters have similar densities
•You need direct cluster labels
•Simplicity is paramount
•You're doing quick exploration

Choose OPTICS When:

•Clusters have varying densities
•You need hierarchical structure
•Optimal ε is unknown
•You want to explore multiple scales
•Visual cluster analysis is important

Practical Workflow

A recommended workflow: (1) Run OPTICS to understand the density structure, (2) Examine reachability plot to identify natural density thresholds, (3) Either extract from OPTICS at chosen threshold or run DBSCAN with informed ε choice. OPTICS serves as an intelligent preprocessing step for DBSCAN parameter selection.

Complexity and Optimizations

Time Complexity Analysis:

OPTICS has similar complexity to DBSCAN:

Naive Implementation:

Core distance computation: O(n²) — computing MinPts-th nearest for each point
Main loop: O(n²) — each point may update all other points
Total: O(n²)

With Spatial Indexing (KD-tree/Ball-tree):

Core distance: O(n log n) — k-nearest queries
Neighborhood queries: O(n log n) average
Priority queue operations: O(n log n)
Total: O(n log n) average, O(n²) worst case

Space Complexity:

Original data: O(n·d)
Core distances: O(n)
Reachability: O(n)
Ordering: O(n)
Priority queue: O(n) worst case
Total: O(n·d) + working memory

Optimization Strategies

•Spatial Indexing: Use KD-trees (low dimension) or Ball-trees (higher dimension) for O(log n) neighbor queries
•Lazy Evaluation: Don't precompute core distances; compute on-demand during expansion
•ε_max Pruning: Smaller ε_max reduces the number of neighbors to consider
•Approximate Neighbors: Use LSH or other approximate methods for very large datasets
•Parallel OPTICS: The initial core distance computation is embarrassingly parallel
•Sample-based Approximation: For huge datasets, run OPTICS on a representative sample

ε_max Trade-offs

Choosing ε_max involves a trade-off: too small might miss sparse clusters, too large increases computation. A good strategy: set ε_max based on the k-distance plot (like DBSCAN's ε selection), where k = MinPts. The 'elbow' suggests a reasonable upper bound.

Summary: OPTICS Algorithm

OPTICS represents an elegant generalization of DBSCAN that addresses the fundamental limitation of fixed-density clustering. Let's consolidate the key concepts:

Key Takeaways

•Multi-Scale Clustering — OPTICS captures clustering structure at all density levels, not just one fixed ε.
•Core Distance — The minimum ε at which a point becomes core; measures local density.
•Reachability Distance — The minimum ε at which q is density-reachable from p; max of core distance and actual distance.
•Ordering Not Labels — OPTICS produces an ordered sequence of points, not direct cluster assignments.
•Reachability Plot — A powerful visualization where valleys represent clusters and peaks represent cluster boundaries.
•Flexible Extraction — Clusters can be extracted at any threshold or automatically using the Xi method.
•DBSCAN Generalization — Extracting at threshold ε gives results equivalent to DBSCAN at that ε.

What's Next:

While OPTICS advances beyond DBSCAN's single-scale limitation, it still requires parameter selection (MinPts, ε_max, extraction parameters). HDBSCAN (Hierarchical DBSCAN) takes the final step: automatically extracting an optimal flat clustering from the hierarchical structure, with minimal user intervention.

Page Complete

You now understand OPTICS' theoretical foundations, algorithmic mechanics, and practical applications. You can interpret reachability plots, extract clusters at various thresholds, and understand when OPTICS is preferable to DBSCAN. Next, we'll explore HDBSCAN—the state-of-the-art evolution of density-based clustering.

3 / 5

Loading learning content...

Machine LearningDensity-Based Clustering

Density-Based Clustering

LevelIntermediate

Duration120 mins

TopicDensity-Based Clustering

3 / 5

OPTICS Algorithm

Beyond Fixed Epsilon: The Multi-Scale Challenge

ε too small: Dense clusters are found, but sparser clusters shatter into noise
ε too large: Sparse clusters are found, but dense clusters merge inappropriately

What You Will Learn

Motivation and Key Insight

The Multi-Density Problem:

Consider a dataset with three clusters:

Cluster A: 1000 points in a 1-unit radius ball (very dense)
Cluster B: 200 points in a 5-unit radius ball (moderate density)
Cluster C: 50 points in a 10-unit radius ball (sparse)

With DBSCAN:

ε = 0.1: Only Cluster A is found; B and C are noise
ε = 0.5: A and B are found; C is noise; A might fragment
ε = 1.0: All three might merge into one cluster

No single ε captures the true three-cluster structure.

The OPTICS Insight

From Clustering to Ordering:

OPTICS fundamentally changes the output type:

DBSCAN	OPTICS
Input: ε, MinPts	Input: ε_max, MinPts
Output: Cluster labels	Output: Ordered list + reachability values
Fixed density threshold	Variable density thresholds
One clustering	Multiple clusterings extractable

Core Distance and Reachability Distance

OPTICS introduces two fundamental concepts that generalize DBSCAN's fixed-ε neighborhoods:

Core Distance:

The core distance of a point p with respect to MinPts, denoted $\text{core-dist}_{\text{MinPts}}(p)$, is the smallest ε value at which p would become a core point:

In other words, it's the distance to p's MinPts-th nearest neighbor. If p doesn't have MinPts neighbors within ε_max, its core distance is undefined.

Interpretation: Core distance tells us the 'local density' at point p. Small core distance means p is in a dense region (neighbors are close). Large core distance means p is in a sparse region.

Core Distance Example

If MinPts = 5 and point p's 5 nearest neighbors are at distances [0.1, 0.2, 0.3, 0.35, 0.5], then core-dist(p) = 0.5 (the 5th nearest distance). At any ε ≥ 0.5, p would be a core point.

Reachability Distance:

The reachability distance of point q with respect to point p is:

$$\text{reach-dist}{\text{MinPts}}(q, p) = \max(\text{core-dist}{\text{MinPts}}(p), d(p, q))$$

This is a key asymmetric measure. It's defined only when p has a defined core distance (p could be a core point at some ε ≤ ε_max).

Interpretation: The reachability distance answers: "What's the minimum ε at which q would be directly density-reachable from p?"

If d(p, q) > core-dist(p): The reachability distance is simply d(p, q)—we need ε that large to include q in p's neighborhood.
If d(p, q) ≤ core-dist(p): The reachability distance is core-dist(p)—even though q is close, p needs ε = core-dist(p) to become core.

Core and Reachability Distance Examples
Scenario	core-dist(p)	d(p,q)	reach-dist(q,p)	Explanation
Dense p, nearby q	0.3	0.2	0.3	q is closer than core threshold, limited by core-dist
Dense p, far q	0.3	0.8	0.8	q is farther than core threshold, limited by d(p,q)
Sparse p, nearby q	1.5	0.2	1.5	p needs large ε to be core, dominates reach-dist
Sparse p, far q	1.5	2.0	2.0	Both p's sparsity and q's distance are large

Why Reachability Distance Matters:

The reachability distance smooths out the 'density transition' between regions:

Within dense clusters: Reachability distances are small (limited by small core distances)
At cluster boundaries: Reachability distances spike (transitioning to sparse regions)
Between clusters: Reachability distances are large (sparse void between clusters)

This pattern creates the characteristic valleys and peaks in reachability plots that reveal cluster structure.

The OPTICS Algorithm

OPTICS produces an ordering of points such that spatially close points (at any density scale) appear near each other in the ordering. The algorithm maintains and outputs two arrays:

Ordered list: The sequence in which points are processed
Reachability values: The reachability distance at which each point was reached

High-Level Algorithm:

OPTICS(D, ε_max, MinPts):
    Compute core-distance for all points in D
    
    ordered_list ← []
    reachability ← {} // maps point → reachability distance
    processed ← {} // tracks which points are done
    
    for each unprocessed point p in D:
        mark p as processed
        append p to ordered_list
        reachability[p] ← INFINITY (or UNDEFINED)
        
        if core-dist(p) is defined:
            seeds ← empty priority queue (min-heap by reach-dist)
            update(seeds, p, processed)  // add p's neighbors
            
            while seeds is not empty:
                q ← extract-min(seeds)
                mark q as processed
                append q to ordered_list
                reachability[q] ← q's priority value
                
                if core-dist(q) is defined:
                    update(seeds, q, processed)
    
    return ordered_list, reachability

The Update Procedure:

update(seeds, p, processed):
    for each q in N_{ε_max}(p):
        if q is not processed:
            new_reach_dist ← reach-dist(q, p)
            
            if q is not in seeds:
                insert q into seeds with priority new_reach_dist
            else if new_reach_dist < current priority of q:
                decrease priority of q to new_reach_dist

Algorithm Properties:

Key Properties

•Density-First Expansion: The priority queue ensures we always process the point with smallest reachability distance next—staying within dense regions as long as possible.
•Lazy Cluster Discovery: Points from one cluster are fully processed before 'jumping' to another cluster, creating natural cluster groupings in the ordering.
•Captures All Scales: The reachability values encode clustering structure from the finest (densest) to coarsest (sparsest) levels.
•DBSCAN as Special Case: Extracting clusters at a fixed reachability threshold is equivalent to running DBSCAN at that ε.

optics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
import numpy as np
import heapq
from typing import List, Tuple, Optional
from dataclasses import dataclass, field
 
@dataclass
class OPTICSResult:
    """Result of OPTICS algorithm."""
    ordering: np.ndarray           # Indices in cluster order
    reachability: np.ndarray       # Reachability distance for each point
    core_distances: np.ndarray     # Core distance for each point
    predecessor: np.ndarray        # Which point reached each point
 
 
class OPTICS:
    """
    OPTICS: Ordering Points To Identify the Clustering Structure
    
    Produces an ordering of points that reveals clustering structure
    at multiple density levels.
    
    Parameters
    ----------
    max_eps : float
        Maximum neighborhood radius. Points farther than this are never
        considered neighbors.
    min_pts : int
        Minimum points for core point definition.
    
    Attributes
    ----------
    ordering_ : ndarray
        The OPTICS ordering of points.
    reachability_ : ndarray
        Reachability distance for each point in ordering.
    """
    
    UNDEFINED = float('inf')
    
    def __init__(self, max_eps: float = float('inf'), min_pts: int = 5):
        self.max_eps = max_eps
        self.min_pts = min_pts
        self.ordering_ = None
        self.reachability_ = None
        self.core_distances_ = None
    
    def fit(self, X: np.ndarray) -> 'OPTICS':
        """
        Compute OPTICS ordering and reachability.
        
        Parameters
        ----------
        X : ndarray of shape (n_samples, n_features)
            Input data.
        
        Returns
        -------
        self
        """
        n_samples = X.shape[0]
        
        # Compute all pairwise distances (for simplicity)
        # In production, use spatial indexing
        distances = self._compute_distances(X)
        
        # Compute core distances
        core_distances = self._compute_core_distances(distances)
        
        # OPTICS main loop
        processed = np.zeros(n_samples, dtype=bool)
        ordering = []
        reachability = np.full(n_samples, self.UNDEFINED)
        predecessor = np.full(n_samples, -1, dtype=int)
        
        for i in range(n_samples):
            if processed[i]:
                continue
            
            # Start processing from point i
            self._process_point(
                i, X, distances, core_distances,
                processed, ordering, reachability, predecessor
            )
        
        # Convert ordering to array form
        ordering = np.array(ordering)
        
        # Reachability in ordering sequence
        reachability_ordered = reachability[ordering]
        
        self.ordering_ = ordering
        self.reachability_ = reachability_ordered
        self.core_distances_ = core_distances
        
        return self
    
    def _compute_distances(self, X: np.ndarray) -> np.ndarray:
        """Compute pairwise Euclidean distances."""
        sq_norms = np.sum(X ** 2, axis=1)
        distances = np.sqrt(
            sq_norms[:, np.newaxis] + sq_norms[np.newaxis, :] - 
            2 * X @ X.T
        )
        return np.maximum(distances, 0)  # Handle numerical issues
    
    def _compute_core_distances(self, distances: np.ndarray) -> np.ndarray:
        """Compute core distance for each point."""
        n_samples = distances.shape[0]
        core_distances = np.full(n_samples, self.UNDEFINED)
        
        for i in range(n_samples):
            # Get distances to all neighbors within max_eps
            dists = distances[i]
            within_eps = dists[dists <= self.max_eps]
            
            if len(within_eps) >= self.min_pts:
                # Core distance is distance to min_pts-th nearest neighbor
                sorted_dists = np.sort(within_eps)
                core_distances[i] = sorted_dists[self.min_pts - 1]
        
        return core_distances
    
    def _process_point(
        self,
        point_idx: int,
        X: np.ndarray,
        distances: np.ndarray,
        core_distances: np.ndarray,
        processed: np.ndarray,
        ordering: List[int],
        reachability: np.ndarray,
        predecessor: np.ndarray
    ) -> None:
        """Process a point and expand its neighborhood."""
        
        # Priority queue: (reachability_dist, point_index)
        # Using list as heap with heapq
        seeds = []
        
        processed[point_idx] = True
        ordering.append(point_idx)
        
        if core_distances[point_idx] < self.UNDEFINED:
            # This point can be a core point - expand
            self._update_seeds(
                point_idx, distances, core_distances,
                processed, seeds, reachability, predecessor
            )
            
            while seeds:
                _, next_idx = heapq.heappop(seeds)
                
                if processed[next_idx]:
                    continue
                
                processed[next_idx] = True
                ordering.append(next_idx)
                
                if core_distances[next_idx] < self.UNDEFINED:
                    self._update_seeds(
                        next_idx, distances, core_distances,
                        processed, seeds, reachability, predecessor
                    )
    
    def _update_seeds(
        self,
        center_idx: int,
        distances: np.ndarray,
        core_distances: np.ndarray,
        processed: np.ndarray,
        seeds: List[Tuple[float, int]],
        reachability: np.ndarray,
        predecessor: np.ndarray
    ) -> None:
        """Update seed set with neighbors of center point."""
        
        core_dist = core_distances[center_idx]
        neighbors = np.where(distances[center_idx] <= self.max_eps)[0]
        
        for neighbor_idx in neighbors:
            if processed[neighbor_idx]:
                continue
            
            # Compute reachability distance
            dist = distances[center_idx, neighbor_idx]
            new_reach_dist = max(core_dist, dist)
            
            if new_reach_dist < reachability[neighbor_idx]:
                reachability[neighbor_idx] = new_reach_dist
                predecessor[neighbor_idx] = center_idx
                heapq.heappush(seeds, (new_reach_dist, neighbor_idx))
 
 
def extract_dbscan_clusters(
    optics_result: OPTICS,
    eps: float
) -> np.ndarray:
    """
    Extract DBSCAN-like clusters at a specific epsilon threshold.
    
    Points with reachability > eps start new clusters.
    """
    ordering = optics_result.ordering_
    reachability = optics_result.reachability_
    core_distances = optics_result.core_distances_
    
    n_samples = len(ordering)
    labels = np.full(n_samples, -1, dtype=int)  # -1 for noise
    cluster_id = -1
    
    for i, point_idx in enumerate(ordering):
        if reachability[i] > eps:
            # This point is not density-reachable at this eps
            if core_distances[point_idx] <= eps:
                # But it's a core point - start new cluster
                cluster_id += 1
                labels[point_idx] = cluster_id
            # else: it's noise
        else:
            # Density-reachable - add to current cluster
            labels[point_idx] = cluster_id
    
    return labels
 
 
# Example usage
if __name__ == "__main__":
    np.random.seed(42)
    
    # Create clusters with different densities
    cluster1 = np.random.normal([0, 0], 0.3, (100, 2))  # Dense
    cluster2 = np.random.normal([4, 4], 1.0, (80, 2))   # Sparse
    cluster3 = np.random.normal([8, 0], 0.5, (60, 2))   # Medium
    noise = np.random.uniform(-2, 10, (20, 2))
    
    X = np.vstack([cluster1, cluster2, cluster3, noise])
    
    # Run OPTICS
    optics = OPTICS(max_eps=3.0, min_pts=5)
    optics.fit(X)
    
    print("OPTICS ordering computed")
    print(f"Reachability range: {optics.reachability_.min():.3f} to {optics.reachability_.max():.3f}")

Reachability Plots: Visualizing Cluster Structure

The reachability plot is OPTICS' signature visualization. It's a bar plot where:

X-axis: Points in OPTICS ordering (index 0 to n-1)
Y-axis: Reachability distance of each point

Interpreting the Reachability Plot:

The plot reveals cluster structure through its characteristic shape:

Valleys = Clusters: Sequences of low reachability values indicate dense, connected regions—clusters. Points within a valley are mutually reachable at low ε thresholds.

Peaks = Cluster Boundaries: Spikes in reachability indicate transitions between clusters. The height of the peak corresponds to the 'gap' between clusters—higher peaks mean more separation.

Plateau = Noise: Flat regions at high reachability values often represent noise or sparse, unstructured regions.

Reading the Plot

Hierarchical Structure:

The reachability plot encodes a hierarchical clustering:

Nested Valleys: A cluster (valley) may contain sub-clusters (smaller valleys within it)
Multiple Scales: Different horizontal thresholds reveal different granularities
Dendrogram Relationship: The reachability plot is related to single-linkage dendrograms but with density-based semantics

Example Patterns:

Pattern	Meaning
Sharp V-shape	Well-separated cluster with clear boundaries
Broad U-shape	Loose cluster with fuzzy boundaries
Multiple nested Vs	Hierarchical sub-cluster structure
Saw-tooth pattern	Many small, similar-density clusters
Flat line at maximum	Noise or no structure at this scale

reachability_plot.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
from typing import List, Tuple, Optional
 
def plot_reachability(
    reachability: np.ndarray,
    ordering: np.ndarray,
    labels: Optional[np.ndarray] = None,
    threshold: Optional[float] = None,
    title: str = "OPTICS Reachability Plot"
) -> plt.Figure:
    """
    Create an OPTICS reachability plot.
    
    Parameters
    ----------
    reachability : ndarray
        Reachability distances in ordering.
    ordering : ndarray
        OPTICS ordering of point indices.
    labels : ndarray, optional
        Cluster labels for coloring.
    threshold : float, optional
        Draw a horizontal line at this threshold.
    
    Returns
    -------
    fig : matplotlib.Figure
    """
    fig, ax = plt.subplots(figsize=(14, 6))
    
    n = len(reachability)
    
    # Replace infinities for plotting
    reach_plot = reachability.copy()
    max_finite = np.max(reach_plot[np.isfinite(reach_plot)])
    reach_plot[~np.isfinite(reach_plot)] = max_finite * 1.1
    
    if labels is not None:
        # Color by cluster
        unique_labels = np.unique(labels)
        colors = plt.cm.tab20(np.linspace(0, 1, len(unique_labels)))
        color_map = {label: colors[i] for i, label in enumerate(unique_labels)}
        
        bar_colors = [color_map[labels[ordering[i]]] for i in range(n)]
        ax.bar(range(n), reach_plot, width=1.0, color=bar_colors, edgecolor='none')
    else:
        ax.bar(range(n), reach_plot, width=1.0, color='steelblue', edgecolor='none')
    
    if threshold is not None:
        ax.axhline(y=threshold, color='red', linestyle='--', 
                   linewidth=2, label=f'ε = {threshold}')
        ax.legend()
    
    ax.set_xlabel('Points (OPTICS ordering)', fontsize=12)
    ax.set_ylabel('Reachability Distance', fontsize=12)
    ax.set_title(title, fontsize=14)
    ax.set_xlim(0, n)
    ax.set_ylim(0, max_finite * 1.15)
    
    plt.tight_layout()
    return fig
 
 
def annotate_clusters(
    ax: plt.Axes,
    reachability: np.ndarray,
    cluster_spans: List[Tuple[int, int, str]]
) -> None:
    """
    Add cluster annotations to a reachability plot.
    
    Parameters
    ----------
    ax : matplotlib Axes
        The plot axes.
    cluster_spans : list of (start, end, label)
        Cluster regions to annotate.
    """
    max_reach = np.max(reachability[np.isfinite(reachability)])
    
    for start, end, label in cluster_spans:
        # Draw bracket
        mid = (start + end) / 2
        ax.annotate(
            label,
            xy=(mid, 0),
            xytext=(mid, -max_reach * 0.15),
            ha='center',
            fontsize=10,
            fontweight='bold',
            arrowprops=dict(arrowstyle='-', color='gray')
        )
        
        # Draw span line
        ax.hlines(
            y=-max_reach * 0.08,
            xmin=start,
            xmax=end,
            color='gray',
            linewidth=2
        )
 
 
def interactive_threshold_analysis(
    reachability: np.ndarray,
    ordering: np.ndarray,
    core_distances: np.ndarray,
    thresholds: List[float]
) -> None:
    """
    Show cluster extraction at multiple thresholds.
    """
    n_thresholds = len(thresholds)
    fig, axes = plt.subplots(n_thresholds, 1, figsize=(14, 4 * n_thresholds))
    
    if n_thresholds == 1:
        axes = [axes]
    
    for ax, eps in zip(axes, thresholds):
        # Extract clusters at this threshold
        labels = np.full(len(ordering), -1, dtype=int)
        labels_ordered = np.full(len(ordering), -1, dtype=int)
        cluster_id = -1
        
        for i, point_idx in enumerate(ordering):
            if reachability[i] > eps:
                if core_distances[point_idx] <= eps:
                    cluster_id += 1
                    labels_ordered[i] = cluster_id
                    labels[point_idx] = cluster_id
            else:
                labels_ordered[i] = cluster_id
                labels[point_idx] = cluster_id
        
        # Plot
        reach_plot = reachability.copy()
        max_finite = np.max(reach_plot[np.isfinite(reach_plot)])
        reach_plot[~np.isfinite(reach_plot)] = max_finite * 1.1
        
        # Color by cluster
        n_clusters = cluster_id + 1
        colors = plt.cm.tab20(np.linspace(0, 1, max(n_clusters + 1, 2)))
        bar_colors = []
        for i in range(len(ordering)):
            if labels_ordered[i] == -1:
                bar_colors.append('gray')
            else:
                bar_colors.append(colors[labels_ordered[i] % 20])
        
        ax.bar(range(len(ordering)), reach_plot, width=1.0, 
               color=bar_colors, edgecolor='none')
        ax.axhline(y=eps, color='red', linestyle='--', linewidth=2)
        
        n_noise = np.sum(labels == -1)
        ax.set_title(f'ε = {eps:.2f}: {n_clusters} clusters, {n_noise} noise points')
        ax.set_ylabel('Reachability')
        ax.set_xlim(0, len(ordering))
    
    axes[-1].set_xlabel('OPTICS Ordering')
    plt.tight_layout()
    plt.show()

Cluster Extraction Methods

The OPTICS ordering is not a clustering per se—it's a data structure from which clusterings can be extracted. Several methods exist for extracting meaningful clusters:

Method 1: DBSCAN-Style Extraction (Threshold Method)

The simplest approach: pick a reachability threshold ε and treat all points with reachability ≤ ε as clustered:

Scan the ordering from left to right
When reachability[i] > ε and core_dist[i] ≤ ε: start new cluster
When reachability[i] > ε and core_dist[i] > ε: mark as noise
When reachability[i] ≤ ε: add to current cluster

This is equivalent to running DBSCAN with that ε value.

Advantages: Simple, interpretable, DBSCAN-compatible Disadvantages: Requires choosing a single threshold, doesn't capture hierarchical structure

Method 2: Xi Cluster Extraction

The Xi method automatically detects significant clusters by finding steep downward and upward gradients in the reachability plot:

Steep Down Area (SDA): A region where reachability is decreasing significantly—entering a cluster
Steep Up Area (SUA): A region where reachability is increasing significantly—leaving a cluster

Clusters are the regions between matching SDA-SUA pairs. The steepness threshold ξ (xi) controls what counts as 'significant':

$$\text{steep_down}(i) \Leftrightarrow \text{reach}(i+1) \leq \text{reach}(i) \cdot (1 - \xi)$$

$$\text{steep_up}(i) \Leftrightarrow \text{reach}(i+1) \geq \text{reach}(i) \cdot (1 + \xi) / (1 - \xi)$$

Advantages: Automatic, can extract hierarchical clusters Disadvantages: More complex, ξ parameter still needs tuning

Cluster Extraction Guidelines

•Use threshold method when you have a natural density scale from domain knowledge
•Use Xi method for automatic, parameter-free (mostly) cluster detection
•Try multiple thresholds to understand the hierarchical structure
•Match DBSCAN by using threshold = ε you would have used in DBSCAN
•Visual inspection of the reachability plot often suggests natural thresholds at 'valley floors'

The Xi Parameter

OPTICS vs. DBSCAN: When to Use Which

While OPTICS generalizes DBSCAN, they serve different purposes and have different trade-offs:

OPTICS vs. DBSCAN Comparison
Criterion	DBSCAN	OPTICS
Primary output	Cluster labels	Ordering + reachability values
Density handling	Single density scale	All density scales
Parameters	ε, MinPts	ε_max, MinPts (+ extraction params)
Time complexity	O(n log n)	O(n log n) with same indexing
Space complexity	O(n)	O(n) for ordering + values
Cluster hierarchy	Not available	Fully captured
Interpretability	Direct cluster assignments	Requires extraction step
Parameter sensitivity	High (must choose ε)	Lower (ε_max is an upper bound)

Choose DBSCAN When:

•You know the appropriate density scale
•Clusters have similar densities
•You need direct cluster labels
•Simplicity is paramount
•You're doing quick exploration

Choose OPTICS When:

•Clusters have varying densities
•You need hierarchical structure
•Optimal ε is unknown
•You want to explore multiple scales
•Visual cluster analysis is important

Practical Workflow

Complexity and Optimizations

Time Complexity Analysis:

OPTICS has similar complexity to DBSCAN:

Naive Implementation:

Core distance computation: O(n²) — computing MinPts-th nearest for each point
Main loop: O(n²) — each point may update all other points
Total: O(n²)

With Spatial Indexing (KD-tree/Ball-tree):

Core distance: O(n log n) — k-nearest queries
Neighborhood queries: O(n log n) average
Priority queue operations: O(n log n)
Total: O(n log n) average, O(n²) worst case

Space Complexity:

Original data: O(n·d)
Core distances: O(n)
Reachability: O(n)
Ordering: O(n)
Priority queue: O(n) worst case
Total: O(n·d) + working memory

Optimization Strategies

•Spatial Indexing: Use KD-trees (low dimension) or Ball-trees (higher dimension) for O(log n) neighbor queries
•Lazy Evaluation: Don't precompute core distances; compute on-demand during expansion
•ε_max Pruning: Smaller ε_max reduces the number of neighbors to consider
•Approximate Neighbors: Use LSH or other approximate methods for very large datasets
•Parallel OPTICS: The initial core distance computation is embarrassingly parallel
•Sample-based Approximation: For huge datasets, run OPTICS on a representative sample

ε_max Trade-offs

Summary: OPTICS Algorithm

OPTICS represents an elegant generalization of DBSCAN that addresses the fundamental limitation of fixed-density clustering. Let's consolidate the key concepts:

Key Takeaways

•Multi-Scale Clustering — OPTICS captures clustering structure at all density levels, not just one fixed ε.
•Core Distance — The minimum ε at which a point becomes core; measures local density.
•Reachability Distance — The minimum ε at which q is density-reachable from p; max of core distance and actual distance.
•Ordering Not Labels — OPTICS produces an ordered sequence of points, not direct cluster assignments.
•Reachability Plot — A powerful visualization where valleys represent clusters and peaks represent cluster boundaries.
•Flexible Extraction — Clusters can be extracted at any threshold or automatically using the Xi method.
•DBSCAN Generalization — Extracting at threshold ε gives results equivalent to DBSCAN at that ε.

What's Next:

Page Complete

3 / 5