Loading content...
Clustering is a fundamental unsupervised learning technique that groups similar data points together without predefined labels. Among various clustering approaches, density-based clustering stands out for its ability to discover clusters of arbitrary shapes and automatically identify outliers in the data.
The core idea behind density-based clustering is elegantly simple: clusters are regions of high point density separated by regions of low density. Unlike centroid-based methods (like K-Means), this approach doesn't require specifying the number of clusters beforehand and can naturally detect irregularly shaped clusters.
Core Points: A point is considered a core point if it has at least min_neighbors other points within a distance of epsilon (including itself in some definitions). Core points are the foundation of dense regions.
Neighborhood: The epsilon-neighborhood of a point p consists of all points within Euclidean distance epsilon from p:
$$N_{\epsilon}(p) = {q \in X : |p - q|_2 \leq \epsilon}$$
Density Reachability: A point q is density-reachable from a core point p if there exists a chain of core points connecting them, where each consecutive pair is within epsilon distance.
Noise Points: Points that are not core points and are not density-reachable from any core point are classified as noise (outliers) and assigned the label -1.
Identify Core Points: For each point, count neighbors within epsilon distance. Points with at least min_neighbors neighbors become core points.
Form Clusters: Starting from an unvisited core point, recursively expand the cluster by including all density-reachable points. All connected core points and their border points form a single cluster.
Label Noise: Any remaining points not assigned to a cluster are marked as noise with label -1.
Write a Python function density_cluster that implements density-based spatial clustering. The function should:
X of shape (n_samples, n_features) containing the data pointsepsilon, the maximum distance for points to be considered neighborsmin_neighbors, the minimum number of points required to form a dense coreCluster labels should be assigned sequentially starting from 0 in the order clusters are discovered.
X = np.array([[1.0, 1.0], [1.5, 1.5], [2.0, 1.0], [8.0, 8.0], [8.5, 8.5], [9.0, 8.0]])
epsilon = 2.0
min_neighbors = 2[0, 0, 0, 1, 1, 1]The dataset contains 6 points forming two distinct spatial groups.
Cluster 0 (first group): • Points [1.0, 1.0], [1.5, 1.5], and [2.0, 1.0] are close together • Distance between [1.0, 1.0] and [1.5, 1.5] ≈ 0.71 < 2.0 ✓ • Distance between [1.0, 1.0] and [2.0, 1.0] = 1.0 < 2.0 ✓ • Each point has at least 2 neighbors within epsilon=2.0, so all are core points • They form a connected dense region → Cluster 0
Cluster 1 (second group): • Points [8.0, 8.0], [8.5, 8.5], and [9.0, 8.0] form another tight group • All pairwise distances are well under epsilon=2.0 • They form another connected dense region → Cluster 1
Separation Check: • Distance between [2.0, 1.0] and [8.0, 8.0] ≈ 9.22 >> 2.0 • The two groups are clearly separated with no density connection
No points are noise since all belong to dense regions.
X = np.array([[0.0, 0.0], [0.5, 0.5], [1.0, 0.0], [10.0, 10.0], [100.0, 100.0]])
epsilon = 2.0
min_neighbors = 2[0, 0, 0, -1, -1]This example demonstrates noise detection capability.
Cluster 0: • Points [0.0, 0.0], [0.5, 0.5], and [1.0, 0.0] are clustered near the origin • All pairwise distances are < 2.0, and each has at least 2 neighbors • These points form a dense region → Cluster 0
Noise Points (label = -1): • Point [10.0, 10.0] is isolated—no other points within epsilon=2.0 • Point [100.0, 100.0] is extremely isolated in the feature space • Neither point has enough neighbors to be a core point • Neither is reachable from the dense cluster • Both are classified as noise → Label -1
This demonstrates the algorithm's inherent outlier detection capability.
X = np.array([[0.0, 0.0], [0.5, 0.0], [0.0, 0.5], [5.0, 5.0], [5.5, 5.0], [5.0, 5.5], [10.0, 0.0], [10.5, 0.0], [10.0, 0.5]])
epsilon = 1.0
min_neighbors = 2[0, 0, 0, 1, 1, 1, 2, 2, 2]This example shows discovery of multiple well-separated clusters.
Cluster 0 (near origin): • Points [0.0, 0.0], [0.5, 0.0], [0.0, 0.5] form a tight triangle • All pairwise distances ≤ 0.71 < 1.0 • Dense region with 3 mutually reachable points → Cluster 0
Cluster 1 (center region): • Points [5.0, 5.0], [5.5, 5.0], [5.0, 5.5] form another tight group • Positioned around coordinates (5, 5) • Dense region → Cluster 1
Cluster 2 (right side): • Points [10.0, 0.0], [10.5, 0.0], [10.0, 0.5] form the third group • Located along the x-axis around x=10 • Dense region → Cluster 2
With epsilon=1.0 (tighter threshold), the algorithm successfully identifies 3 distinct clusters. All points are core points as each has exactly 2 neighbors within the epsilon distance.
Constraints