Clustering Problem Formulation - Learning Module

Loading content...

0/245

Similarity and Distance

The Foundation of Clustering

At the heart of every clustering algorithm lies a deceptively simple question: How similar are two data points? This question seems trivial—we can often tell at a glance whether two objects are alike. But translating human intuition about similarity into mathematical precision reveals surprising depth and complexity.

The choice of similarity or distance measure is arguably the most important decision in clustering. Two clustering runs on identical data with identical algorithms can produce completely different results based solely on the distance metric. A metric that works beautifully for images may fail catastrophically for text; a measure ideal for low dimensions may become meaningless in high dimensions.

This page equips you with a rigorous understanding of distance and similarity: the mathematical properties they must satisfy, the most important metrics for different data types, and the crucial considerations for choosing the right measure for your problem.

What You Will Learn

By the end of this page, you will understand the mathematical properties that define metrics and similarity measures. You'll master the Minkowski family of distances, including Euclidean and Manhattan distances. You'll learn specialized measures for different data types, understand the curse of dimensionality's effect on distances, and develop intuition for choosing appropriate measures.

Distance vs. Similarity

Before diving into specific measures, we must distinguish between two related but fundamentally different concepts: distance and similarity. While they're often used interchangeably in casual discussion, they have distinct mathematical properties.

Distance (Dissimilarity):

A distance function $d: X \times X \rightarrow \mathbb{R}$ quantifies how different two objects are. Higher values mean objects are more dissimilar. Distances are unbounded above—there's no maximum distance.

Similarity:

A similarity function $s: X \times X \rightarrow \mathbb{R}$ quantifies how alike two objects are. Higher values mean objects are more similar. Similarities are often bounded (e.g., between 0 and 1).

The Relationship:

Distance and similarity are inversely related. Given a distance $d$, common transformations to similarity include:

$$s = \frac{1}{1 + d} \quad \text{(bounded between 0 and 1)}$$

$$s = e^{-d} \quad \text{(exponential decay)}$$

$$s = e^{-d^2/2\sigma^2} \quad \text{(Gaussian kernel)}$$

Conversely, given a bounded similarity in $[0, 1]$:

$$d = 1 - s \quad \text{(linear transformation)}$$

$$d = -\log(s) \quad \text{(logarithmic, for positive similarities)}$$

Practical Implication

The transformation matters! Different transformations from distance to similarity can yield different clustering results because they change the relative emphasis on near vs. far points. The Gaussian kernel, for instance, treats very distant points as nearly equally dissimilar, while linear transformation preserves relative distances.

Mathematical Properties of Metrics

A metric (or distance function) on a set $X$ is a function $d: X \times X \rightarrow \mathbb{R}$ that satisfies four fundamental properties. Understanding these properties is crucial because they guarantee that our intuitive notion of "distance" behaves consistently.

The Four Axioms of a Metric:

For all $x, y, z \in X$:

Non-negativity: $d(x, y) \geq 0$
- Distance is never negative. Two points can be identical (distance 0) but never "negatively distant."
Identity of Indiscernibles: $d(x, y) = 0 \Leftrightarrow x = y$
- Zero distance means the points are identical, and vice versa.
Symmetry: $d(x, y) = d(y, x)$
- The distance from A to B equals the distance from B to A.
Triangle Inequality: $d(x, z) \leq d(x, y) + d(y, z)$
- The direct path is never longer than going through an intermediate point.

Why Each Metric Axiom Matters for Clustering
Axiom	Intuitive Meaning	Clustering Implication
Non-negativity	Distance is always positive or zero	We can meaningfully compare distances: A is 'closer' than B
Identity	Only identical points have zero distance	Duplicate detection works; cluster centers are well-defined
Symmetry	Direction doesn't matter	If A is similar to B, B is similar to A; clusters are symmetric
Triangle Inequality	No shortcuts through intermediate points	Enables efficient algorithms (pruning), hierarchical structures

Relaxed Notions:

Not all useful measures satisfy all four axioms:

Pseudometric: Violates identity (distinct points can have zero distance). Example: distance between equivalence classes.
Quasimetric: Violates symmetry ($d(x,y) \neq d(y,x)$). Example: travel time with one-way streets.
Semimetric: Violates triangle inequality. Example: some edit distances.

Clustering algorithms often work with measures that aren't true metrics. This is fine, but you should be aware of which guarantees you're losing and how that affects algorithm behavior.

Triangle Inequality in Practice

The triangle inequality is particularly important for computational efficiency. Many fast nearest-neighbor algorithms (ball trees, KD-trees, cover trees) rely on it for pruning. If your distance measure violates the triangle inequality, these speedups don't apply, and you may need brute-force O(n²) computations.

The Minkowski Family of Distances

The most widely used distances for numerical data belong to the Minkowski family, parameterized by a single value $p \geq 1$. For vectors $x, y \in \mathbb{R}^d$:

$$d_p(x, y) = \left( \sum_{i=1}^{d} |x_i - y_i|^p \right)^{1/p}$$

This elegant formulation unifies several important distances under a single framework.

Special Cases of Minkowski Distance

•$p = 1$ (Manhattan/L1 Distance): $d_1(x, y) = \sum_{i=1}^{d} |x_i - y_i|$ — Sum of absolute differences. Named after the grid-like streets of Manhattan where you can only travel along axes.
•$p = 2$ (Euclidean/L2 Distance): $d_2(x, y) = \sqrt{\sum_{i=1}^{d} (x_i - y_i)^2}$ — The familiar straight-line distance. Most intuitive for humans; forms spherical equidistant contours.
•$p = \infty$ (Chebyshev/L∞ Distance): $d_\infty(x, y) = \max_i |x_i - y_i|$ — Maximum absolute difference. A king's moves on a chessboard; forms square equidistant contours.

Geometric Interpretation:

Different values of $p$ define different shapes for the "unit ball" (the set of all points at distance 1 from the origin):

$p = 1$: Diamond/hypercube rotated 45°
$p = 2$: Circle/hypersphere (round)
$p \rightarrow \infty$: Square/hypercube aligned with axes

This geometric difference has profound implications for clustering. Euclidean distance treats all directions equally, while Manhattan distance emphasizes axis-aligned differences.

When to Use Each:

Distance	Best For	Rationale
Manhattan ($p=1$)	High-dimensional data, sparse data	Less sensitive to outliers; differences don't get squared
Euclidean ($p=2$)	Low-dimensional, isotropic data	Natural geometric interpretation; rotation-invariant
Chebyshev ($p=\infty$)	Worst-case analysis, chess problems	When any single large difference is disqualifying
General $p$	Empirically tuned	Sometimes intermediate values (e.g., $p=1.5$) work best

minkowski_distances.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import numpy as np
from scipy.spatial.distance import minkowski, euclidean, cityblock, chebyshev
 
# Example: Two 5-dimensional points
x = np.array([1, 2, 3, 4, 5])
y = np.array([5, 4, 3, 2, 1])
 
# Minkowski distances with different p values
print("Manhattan (L1):", cityblock(x, y))        # p=1: |1-5| + |2-4| + |3-3| + |4-2| + |5-1| = 12
print("Euclidean (L2):", euclidean(x, y))        # p=2: sqrt((1-5)² + (2-4)² + (3-3)² + (4-2)² + (5-1)²) = 6.32
print("Chebyshev (L∞):", chebyshev(x, y))        # p=∞: max(|1-5|, |2-4|, |3-3|, |4-2|, |5-1|) = 4
print("Minkowski (p=3):", minkowski(x, y, p=3))  # General case: (4³ + 2³ + 0³ + 2³ + 4³)^(1/3) = 5.24
 
# Observe: L1 >= L2 >= L∞ always holds
# The relationship is: ||x||_∞ <= ||x||_2 <= ||x||_1

Cosine Similarity and Angular Distance

For many applications, the magnitude of vectors is irrelevant—only their direction matters. Cosine similarity measures the angle between vectors, making it ideal for comparing documents, user preferences, or any data where scale is arbitrary.

Definition:

$$\cos(x, y) = \frac{x \cdot y}{|x|2 \cdot |y|2} = \frac{\sum{i=1}^{d} x_i y_i}{\sqrt{\sum{i=1}^{d} x_i^2} \cdot \sqrt{\sum_{i=1}^{d} y_i^2}}$$

This measures the cosine of the angle $\theta$ between vectors:

$\cos(x, y) = 1$ when vectors point in the same direction ($\theta = 0°$)
$\cos(x, y) = 0$ when vectors are orthogonal ($\theta = 90°$)
$\cos(x, y) = -1$ when vectors point in opposite directions ($\theta = 180°$)

Converting to Distance:

Cosine similarity ranges from -1 to 1 (for general vectors) or 0 to 1 (for non-negative vectors). Common distance transformations:

$$d_{\text{cosine}}(x, y) = 1 - \cos(x, y) \quad \text{(cosine distance, range [0, 2])}$$

$$d_{\text{angular}}(x, y) = \frac{\arccos(\cos(x, y))}{\pi} \quad \text{(angular distance, range [0, 1])}$$

When Magnitude Doesn't Matter

Document similarity is the classic use case: a document that mentions 'machine learning' 100 times isn't necessarily more relevant than one that mentions it 10 times—the relative importance of terms matters, not absolute counts. Similarly, user rating patterns may matter more than whether someone rates on a 1-5 or 1-10 scale.

Relationship to Euclidean Distance:

For L2-normalized vectors (where $|x|_2 = |y|_2 = 1$), there's a beautiful relationship:

$$|x - y|_2^2 = 2(1 - \cos(x, y))$$

This means that for normalized data, Euclidean distance and cosine distance are monotonically related. Clustering with either gives the same results when data is normalized.

Important Caveat:

Cosine distance is not a metric—it violates the triangle inequality. This means:

Ball trees and KD-trees won't work correctly
Hierarchical clustering may behave unexpectedly
Always verify that your algorithm is compatible with non-metric distances

cosine_similarity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
from sklearn.preprocessing import normalize
 
# Document term vectors (word counts)
doc1 = np.array([3, 2, 0, 5, 0, 0, 0, 2, 0, 0])  # Document 1
doc2 = np.array([6, 4, 0, 10, 0, 0, 0, 4, 0, 0]) # Document 2 (same words, 2x counts)
doc3 = np.array([0, 0, 4, 0, 5, 2, 3, 0, 1, 2])  # Document 3 (different words)
 
docs = np.array([doc1, doc2, doc3])
 
# Euclidean distance: doc1 and doc2 seem far apart
print("Euclidean distances:")
print(euclidean_distances(docs))
# doc1-doc2: 10.77 (large because of magnitude difference)
# doc1-doc3: 8.66
# doc2-doc3: 17.09
 
# Cosine similarity: doc1 and doc2 are identical in direction!
print("\nCosine similarities:")
print(cosine_similarity(docs))
# doc1-doc2: 1.0 (identical direction!)
# doc1-doc3: 0.0 (orthogonal - no word overlap)
# doc2-doc3: 0.0 (orthogonal - no word overlap)
 
# After L2 normalization, Euclidean captures direction
docs_normalized = normalize(docs, norm='l2')
print("\nEuclidean distances after L2 normalization:")
print(euclidean_distances(docs_normalized))
# Now doc1-doc2: 0.0 (identical after normalization)

Distance Measures for Categorical Data

Numerical distances like Euclidean don't apply to categorical data. You can't compute the "average" of {red, blue, green} or the "distance" between married and single. Specialized measures are required.

Simple Matching Coefficient:

For two categorical vectors $x, y$ with $d$ attributes:

$$d_{\text{simple}}(x, y) = \frac{\text{number of mismatches}}{d} = \frac{\sum_{i=1}^{d} \mathbb{1}[x_i \neq y_i]}{d}$$

This treats each attribute equally and counts disagreements.

Hamming Distance:

$$d_{\text{Hamming}}(x, y) = \sum_{i=1}^{d} \mathbb{1}[x_i \neq y_i]$$

The unnormalized count of differences. Originally defined for binary strings, but extends to any categorical data.

Jaccard Similarity (for Sets):

When features represent presence/absence (binary categorical), Jaccard measures the overlap:

$$J(A, B) = \frac{|A \cap B|}{|A \cup B|}$$

Jaccard distance: $d_J(A, B) = 1 - J(A, B)$

This is particularly useful for:

Document similarity (sets of words)
User preferences (sets of liked items)
Biological data (sets of genes/proteins)

Gower's Distance (Mixed Data):

Real-world data often mixes numerical and categorical features. Gower's distance handles this:

$$d_{\text{Gower}}(x, y) = \frac{\sum_{i=1}^{d} w_i \cdot d_i(x_i, y_i)}{\sum_{i=1}^{d} w_i}$$

where $d_i$ is an appropriate distance for feature $i$:

For numerical: $d_i = |x_i - y_i| / \text{range}_i$
For categorical: $d_i = \mathbb{1}[x_i \neq y_i]$
For ordinal: Can use rank difference

$w_i$ are feature weights (default 1, set to 0 for missing values).

Distance Measures by Data Type
Data Type	Recommended Measures	Notes
Binary	Jaccard, Dice, Simple Matching, Hamming	Consider asymmetric measures if 1s and 0s have different meanings
Nominal Categorical	Simple Matching, Hamming	All categories equally dissimilar
Ordinal	Rank-based, Spearman correlation	Order matters but intervals may not
Mixed	Gower's distance	Combines appropriate measures per feature
Counts/Frequencies	Chi-squared, Hellinger	Compare distributions

One-Hot Encoding Pitfall

A common approach is to one-hot encode categorical variables and use Euclidean distance. This works but has subtle issues: all categories become equidistant, which may not reflect domain knowledge (e.g., 'red' might be more similar to 'orange' than to 'blue'). Consider embedding-based approaches or custom distance matrices when category relationships matter.

Specialized Distance Measures

Different domains have developed specialized distance measures tailored to their data characteristics. Understanding these expands your toolkit for real-world clustering problems.

Mahalanobis Distance:

Euclidean distance treats all features equally and assumes independence. Mahalanobis distance accounts for correlations and different variances:

$$d_M(x, y) = \sqrt{(x - y)^T \Sigma^{-1} (x - y)}$$

where $\Sigma$ is the covariance matrix of the data. This effectively:

Normalizes each feature by its standard deviation
Decorrelates correlated features
Makes equidistant contours into ellipsoids aligned with the data distribution

When to use: When features have different scales or strong correlations; when you want distance relative to the data's natural variation.

String/Sequence Distances

•Levenshtein (Edit) Distance — Minimum insertions, deletions, substitutions to transform one string into another
•Jaro-Winkler — Weighted edit distance favoring matching prefixes; good for names
•Longest Common Subsequence — Length of longest subsequence present in both strings
•DTW (Dynamic Time Warping) — For time series: allows non-linear alignment

Distribution Distances

•KL Divergence — Asymmetric measure between probability distributions
•Jensen-Shannon Divergence — Symmetric version of KL divergence
•Wasserstein (Earth Mover's) Distance — Minimum 'work' to transform one distribution into another
•Hellinger Distance — Related to Bhattacharyya coefficient; bounded [0, 1]

Correlation-Based Distance:

Pearson correlation measures linear relationship between vectors:

$$r(x, y) = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_i (x_i - \bar{x})^2} \cdot \sqrt{\sum_i (y_i - \bar{y})^2}}$$

Correlation distance: $d_r(x, y) = 1 - r(x, y)$

This measures whether vectors increase and decrease together, regardless of their absolute values or scales. Common in gene expression analysis where expression "profiles" matter.

Domain-Specific Examples:

Domain	Distance Measure	Why
Images	SSIM, Perceptual hash	Pixel-wise differences miss perceptual similarity
Graphs	Graph edit distance, Spectral	Structure matters more than labels
GPS/Geospatial	Haversine, Vincenty	Earth curvature matters for distant points
Molecular	Tanimoto coefficient	Molecular fingerprint similarity
Music	Tempo-invariant DTW	Same melody at different speeds should be similar

The Curse of Dimensionality

One of the most important phenomena affecting distance-based methods is the curse of dimensionality. As the number of dimensions increases, distances behave in counterintuitive ways that can break clustering algorithms.

The Concentration of Distances:

Consider $n$ random points uniformly distributed in a $d$-dimensional unit hypercube. As $d$ increases:

$$\frac{d_{\max} - d_{\min}}{d_{\min}} \rightarrow 0 \quad \text{as } d \rightarrow \infty$$

This means that the relative difference between the nearest and farthest points vanishes. All points become approximately equidistant from each other! When everything is equally far apart, the concept of "nearest neighbor" becomes meaningless.

The Practical Impact

With 1000 features and random data, the closest point might be 95% as far as the farthest point. Your clustering algorithm's decisions are based on noise in the 5%, not meaningful structure. This isn't a software bug—it's a fundamental mathematical phenomenon.

Why Does This Happen?

In high dimensions:

Volume concentrates at edges: In a hypercube, almost all volume is near the corners, not the center. A "shell" near the surface contains most points.
Many small contributions add up: Euclidean distance squares differences and sums them. With 1000 dimensions, even tiny per-dimension differences accumulate to large total distances.
Random vectors are nearly orthogonal: Random vectors in high dimensions are almost always perpendicular to each other, making angles uninformative.

Mitigation Strategies:

Coping with High Dimensions

•Dimensionality Reduction — PCA, t-SNE, UMAP reduce dimensions while preserving structure. Cluster in the reduced space.
•Feature Selection — Remove irrelevant features. It's better to cluster on 10 meaningful dimensions than 1000 noisy ones.
•Use Manhattan Distance (L1) — Less sensitive to the curse than Euclidean. Differences don't get squared.
•Fractional Distance Metrics — Using $p < 1$ in Minkowski distance (not a true metric) can help in very high dimensions.
•Subspace Clustering — Different clusters may exist in different subspaces of the feature space.
•Embrace Sparsity — For sparse high-dimensional data (like text), only non-zero dimensions contribute to distance.

A Quantitative Example:

For $n$ points uniformly distributed in $[0, 1]^d$, the expected distance to the nearest neighbor scales as:

$$E[d_{\text{nearest}}] \propto n^{-1/d}$$

As $d$ increases, this approaches 1 (the diameter of the space) for any fixed $n$. To maintain the same expected nearest-neighbor distance as $d$ doubles, you need exponentially more data points.

Choosing the Right Distance Measure

Given the many options, how do you choose the right distance measure? This is where art meets science, and domain knowledge becomes crucial.

Decision Framework:

Distance Selection Guide
Question	If Yes...	If No...
Is magnitude meaningful?	Euclidean or Manhattan	Cosine similarity
Are features on same scale?	Can use raw distances	Standardize first or use Mahalanobis
Are features correlated?	Consider Mahalanobis	Euclidean is fine
Is data categorical?	Simple matching, Jaccard	Use numerical distances
Is data mixed type?	Gower's distance	Choose based on dominant type
Is data high-dimensional?	Consider L1, dimensionality reduction	Euclidean works well
Is data sparse?	Cosine, Jaccard for binary	Dense-data measures OK

Practical Guidelines:

Start with domain knowledge. What does "similarity" mean in your domain? Two customers are similar if they buy similar products—that suggests different features than if they're similar because they shop at similar times.
Standardize numerical features. Unless you have a good reason not to, standardize to zero mean and unit variance. Otherwise, the feature with the largest scale dominates.
Visualize distances. Before clustering, examine the distance distribution. Are there natural gaps? Do distances make intuitive sense for known-similar pairs?
Test multiple measures. Run your clustering with different distance measures and compare results. If they're wildly different, understand why.
Validate with domain experts. Show example point pairs and their computed distances. Do the numbers match expert intuition about similarity?

The Ultimate Test

The right distance measure is the one that makes your clusters useful for your purpose. If your customer clusters don't predict purchasing behavior, the distance measure is wrong for your goal—regardless of its theoretical properties. Always validate against downstream objectives.

Summary: Similarity and Distance

We've built a comprehensive understanding of distance and similarity measures. Let's consolidate the key insights:

Key Takeaways

•Distance quantifies dissimilarity — Metrics satisfy four axioms (non-negativity, identity, symmetry, triangle inequality) that ensure consistent behavior.
•The Minkowski family — Parameterized by $p$, unifying Manhattan ($p=1$), Euclidean ($p=2$), and Chebyshev ($p=\infty$) distances.
•Cosine similarity measures direction — Ideal when magnitude is irrelevant; normalized data makes Euclidean equivalent to cosine.
•Categorical data needs specialized measures — Simple matching, Jaccard, and Gower's distance handle non-numerical features.
•Domain-specific measures exist — From edit distances for strings to DTW for time series, choose measures that match your data semantics.
•Curse of dimensionality breaks distances — In high dimensions, all points become equidistant; mitigation requires dimensionality reduction or careful metric choice.
•Domain knowledge drives selection — No universally "best" distance; the right choice depends on what similarity means in your context.

What's Next:

With distance measures understood, we can now precisely define what a "cluster" actually is. The next page explores the various formal definitions of clusters—from intuitive notions to rigorous mathematical formulations. We'll see how different definitions lead to different clustering algorithms.

Page Complete

You now have a solid foundation in distance and similarity measures for clustering. You understand the mathematical properties, know the major distance families, and can reason about which measures suit different data types and domains. Next, we'll formalize what we mean by a 'cluster.'