T Sne - Learning Module

Loading content...

0/245

Common Pitfalls in t-SNE Usage

The Treacherous Path to Beautiful Visualizations

t-SNE produces some of the most visually compelling dimensionality reduction plots in machine learning. This visual appeal is both a strength and a danger—it's easy to generate a beautiful plot that is fundamentally misleading or suboptimally computed.

After covering the theory, perplexity, optimization, and interpretation of t-SNE, this final page consolidates the practical pitfalls that trip up even experienced practitioners. We'll catalog common mistakes across four categories: parameter choices, data preprocessing, code-level errors, and interpretation failures.

Think of this page as a pre-flight checklist for t-SNE. Before presenting any t-SNE result, run through these pitfalls to ensure you haven't fallen into any of them.

Learning Objectives

By the end of this page, you will: • Recognize and avoid parameter-related pitfalls • Understand preprocessing requirements for t-SNE • Identify computational and implementation mistakes • Have a comprehensive checklist for t-SNE usage • Know when t-SNE is NOT the right tool

Pitfall Category 1: Parameter Mistakes

The most common errors involve misunderstanding or misusing t-SNE's hyperparameters.

Pitfall 1.1: Perplexity Too High or Too Low

Perplexity Errors and Symptoms
Error	Typical Range	Symptoms	Solution
Perplexity ≥ N/3	Too high	Uniform, blob-like embedding; no clear structure	Reduce perplexity; ensure perplexity << N
Perplexity < 5	Too low for most data	Fragmented clusters; noise appears as structure	Increase perplexity to 15-30
Fixed perplexity for all datasets	Inflexible	Inconsistent results across different data	Adjust perplexity based on dataset size

Perplexity Sanity Check
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
def validate_perplexity(N, perplexity):
    """
    Check if perplexity is appropriate for dataset size.
    """
    issues = []
    
    if perplexity >= N / 3:
        issues.append(f"Perplexity ({perplexity}) >= N/3 ({N/3:.0f}). "
                      f"This is too high; results will be meaningless.")
    
    if perplexity >= N:
        issues.append(f"CRITICAL: Perplexity ({perplexity}) >= N ({N}). "
                      f"This will fail or produce garbage.")
    
    if perplexity < 5:
        issues.append(f"Perplexity ({perplexity}) < 5. "
                      f"May produce fragmented, noisy embeddings.")
    
    if N < 100 and perplexity > 30:
        issues.append(f"Small dataset (N={N}) with high perplexity ({perplexity}). "
                      f"Consider reducing to {min(15, N//5)}.")
    
    if not issues:
        print(f"✓ Perplexity {perplexity} looks appropriate for N={N}")
    else:
        for issue in issues:
            print(f"⚠ {issue}")
    
    return issues
 
# Example usage
validate_perplexity(N=1000, perplexity=30)  # OK
validate_perplexity(N=50, perplexity=30)    # Warning
validate_perplexity(N=100, perplexity=40)   # Warning

Pitfall 1.2: Insufficient Iterations

A very common mistake: not running enough iterations for the optimization to converge.

The 250 Iteration Trap

Some early tutorials and default settings used n_iter=250. This is far too few! The early exaggeration phase alone typically runs for 250 iterations. Running only 250 total iterations means almost no post-exaggeration optimization. Always use at least n_iter=1000 for reasonable results.

Signs of Insufficient Iterations:

Clusters appear "blobby" or poorly defined
Similar points don't cluster together
The embedding looks uniformly distributed
KL divergence is still decreasing when iteration stops

Fix: Increase n_iter to 1000+ and monitor KL divergence to ensure convergence.

Pitfall 1.3: Default Learning Rate with Large Datasets

Older default learning rate of 200 can be suboptimal for larger datasets.

Correct Learning Rate Usage
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from sklearn.manifold import TSNE
 
# INCORRECT: Fixed learning rate for large dataset
# This may cause slow convergence or instability
tsne_bad = TSNE(
    n_components=2,
    perplexity=30,
    learning_rate=200.0,  # Fixed value - may be inappropriate
    n_iter=1000
)
 
# CORRECT: Use automatic learning rate (sklearn 1.2+)
tsne_good = TSNE(
    n_components=2,
    perplexity=30,
    learning_rate='auto',  # Automatically scales with N
    n_iter=1000
)
 
# Alternative: Manual scaling for older versions
N = len(X)
learning_rate = max(N / 12, 50)  # Heuristic that scales with data size

Pitfall Category 2: Preprocessing Mistakes

Data preprocessing dramatically affects t-SNE results. Many practitioners underestimate its importance.

Pitfall 2.1: Unscaled Features

If features have vastly different scales, distances are dominated by large-scale features.

Without Scaling

•Feature A: range [0, 1]
•Feature B: range [0, 10000]
•Euclidean distance dominated by Feature B
•t-SNE effectively ignores Feature A
•Structure in Feature A invisible

With Scaling

•Feature A: scaled to [0, 1]
•Feature B: scaled to [0, 1]
•Both features contribute equally
•t-SNE sees structure in both
•More meaningful embedding

Proper Preprocessing for t-SNE
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.manifold import TSNE
import numpy as np
 
def preprocess_for_tsne(X, method='standard', handle_outliers=True):
    """
    Properly preprocess data for t-SNE.
    
    Args:
        X: Input data (N, D)
        method: 'standard' (z-score) or 'minmax' ([0,1])
        handle_outliers: Clip extreme values before scaling
    
    Returns:
        X_preprocessed: Preprocessed data
    """
    X = X.copy()
    
    # Step 1: Handle missing values
    if np.isnan(X).any():
        print("Warning: Data contains NaN. Imputing with column means.")
        col_means = np.nanmean(X, axis=0)
        inds = np.where(np.isnan(X))
        X[inds] = np.take(col_means, inds[1])
    
    # Step 2: Handle outliers (optional but recommended)
    if handle_outliers:
        for col in range(X.shape[1]):
            q01 = np.percentile(X[:, col], 1)
            q99 = np.percentile(X[:, col], 99)
            X[:, col] = np.clip(X[:, col], q01, q99)
    
    # Step 3: Scale features
    if method == 'standard':
        scaler = StandardScaler()
    elif method == 'minmax':
        scaler = MinMaxScaler()
    else:
        raise ValueError(f"Unknown method: {method}")
    
    X_scaled = scaler.fit_transform(X)
    
    print(f"Preprocessed: {X.shape[0]} samples, {X.shape[1]} features")
    print(f"Scaling method: {method}")
    print(f"Feature ranges after scaling: "
          f"[{X_scaled.min():.2f}, {X_scaled.max():.2f}]")
    
    return X_scaled
 
# Usage
X_preprocessed = preprocess_for_tsne(X_raw, method='standard')
tsne = TSNE(n_components=2, perplexity=30)
Y = tsne.fit_transform(X_preprocessed)

Pitfall 2.2: Ignoring High Dimensionality

t-SNE on very high-dimensional data (D > 50) can be slow and may give poor results due to the curse of dimensionality.

The PCA-Then-t-SNE Pattern

For D > 50 features, consider reducing to 50 dimensions with PCA before running t-SNE. This speeds up computation (N×D² for distance matrix) and can improve results by removing noisy dimensions. The original t-SNE paper recommends this approach.

PCA-Then-t-SNE Pipeline
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
 
def create_tsne_pipeline(target_pca_dims=50, perplexity=30):
    """
    Create a preprocessing + t-SNE pipeline.
    
    Includes scaling, PCA reduction, and t-SNE.
    """
    return Pipeline([
        ('scaler', StandardScaler()),
        ('pca', PCA(n_components=target_pca_dims)),
        ('tsne', TSNE(
            n_components=2, 
            perplexity=perplexity,
            n_iter=1000,
            learning_rate='auto',
            init='pca',
            random_state=42
        ))
    ])
 
# For very high-dimensional data (e.g., D=10000)
X_high_dim = load_high_dim_data()  # Shape: (N, 10000)
 
pipeline = create_tsne_pipeline(target_pca_dims=50, perplexity=30)
Y = pipeline.fit_transform(X_high_dim)
 
# This is MUCH faster than t-SNE on 10000 dims
# Also often gives BETTER results (less noise)

Pitfall 2.3: Duplicate Points

Exact duplicate points can cause numerical issues and misleading visualizations.

Issues with Duplicates

•Division by zero: Distance = 0 can cause numerical instability in probability computation
•Misleading density: Duplicates artificially inflate apparent cluster density
•Wasted computation: Processing duplicate points is unnecessary
•Interpretation errors: You might think a cluster is larger than it really is

Handle Duplicates Before t-SNE
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import numpy as np
 
def remove_duplicates(X, labels=None, return_counts=False):
    """
    Remove duplicate points before t-SNE.
    
    Returns:
        X_unique: Deduplicated data
        indices: Original indices of unique points
        counts: (optional) How many duplicates each unique point had
    """
    # Find unique rows
    _, unique_indices, inverse, counts = np.unique(
        X, axis=0, return_index=True, return_inverse=True, return_counts=True
    )
    
    n_duplicates = len(X) - len(unique_indices)
    if n_duplicates > 0:
        print(f"Removed {n_duplicates} duplicate points "
              f"({100*n_duplicates/len(X):.1f}%)")
    
    X_unique = X[unique_indices]
    
    # Handle labels if provided
    labels_unique = labels[unique_indices] if labels is not None else None
    
    if return_counts:
        return X_unique, unique_indices, counts[inverse]
    return X_unique, unique_indices, labels_unique
 
# Usage
X_clean, indices, labels_clean = remove_duplicates(X_raw, labels_raw)
Y = tsne.fit_transform(X_clean)

Pitfall Category 3: Computational Mistakes

Even with correct parameters and preprocessing, computational issues can derail t-SNE.

Pitfall 3.1: Not Setting Random Seed

Without a fixed random seed, results are not reproducible.

Reproducibility with Random Seeds
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from sklearn.manifold import TSNE
 
# WRONG: No random state - results change every run
tsne_unreproducible = TSNE(n_components=2, perplexity=30)
Y1 = tsne_unreproducible.fit_transform(X)  # Result A
Y2 = tsne_unreproducible.fit_transform(X)  # Result B (different!)
 
# CORRECT: Set random_state for reproducibility
tsne_reproducible = TSNE(n_components=2, perplexity=30, random_state=42)
Y1 = tsne_reproducible.fit_transform(X)  # Result A
Y2 = tsne_reproducible.fit_transform(X)  # Result A (same!)
 
# BEST PRACTICE: Run with multiple seeds to assess stability
seeds = [42, 123, 456, 789, 101112]
embeddings = []
for seed in seeds:
    tsne = TSNE(n_components=2, perplexity=30, random_state=seed)
    Y = tsne.fit_transform(X)
    embeddings.append(Y)
# Compare embeddings to identify stable vs. unstable structures

Pitfall 3.2: Wrong Algorithm for Dataset Size

Using exact t-SNE on large datasets, or Barnes-Hut on very small datasets.

Algorithm Selection by Dataset Size
Dataset Size	Recommended Method	Time Complexity	Notes
N < 500	method='exact' or 'barnes_hut'	O(N²)	Either works; exact may be more accurate
500 ≤ N < 50,000	method='barnes_hut'	O(N log N)	Default in sklearn; good balance
N ≥ 50,000	Use openTSNE or FIt-SNE	O(N)	sklearn may be slow; specialized implementations needed

Pitfall 3.3: Memory Errors with Large Datasets

t-SNE requires O(N²) memory for the probability matrix in naive implementations.

Memory Requirements

For N=50,000 points: The full probability matrix would require 50,000² × 8 bytes = 20 GB! Modern implementations use sparse matrices (only storing k nearest neighbors), but memory can still be an issue. For very large N, consider subsampling or using out-of-core implementations.

Handling Large Datasets
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import numpy as np
from sklearn.manifold import TSNE
 
def tsne_with_subsample(X, labels=None, max_samples=10000, 
                        perplexity=30, random_state=42):
    """
    Apply t-SNE with optional subsampling for very large datasets.
    
    For datasets larger than max_samples, randomly subsample.
    """
    N = X.shape[0]
    
    if N <= max_samples:
        # Dataset is small enough - use all data
        tsne = TSNE(n_components=2, perplexity=perplexity, 
                    random_state=random_state, n_iter=1000)
        Y = tsne.fit_transform(X)
        return Y, np.arange(N)
    
    # Subsample
    np.random.seed(random_state)
    
    if labels is not None:
        # Stratified sampling to preserve class proportions
        from sklearn.model_selection import train_test_split
        _, X_sample, _, labels_sample = train_test_split(
            X, labels, train_size=max_samples, stratify=labels,
            random_state=random_state
        )
        indices = np.arange(len(X))[train_test_split(
            np.arange(len(X)), train_size=max_samples, 
            stratify=labels, random_state=random_state
        )[1]]
    else:
        # Random sampling
        indices = np.random.choice(N, max_samples, replace=False)
        X_sample = X[indices]
    
    print(f"Subsampled from {N} to {max_samples} points")
    
    tsne = TSNE(n_components=2, perplexity=perplexity,
                random_state=random_state, n_iter=1000)
    Y = tsne.fit_transform(X_sample)
    
    return Y, indices
 
# Usage for large dataset
Y_sample, sample_indices = tsne_with_subsample(
    X_large, labels_large, max_samples=20000
)

Pitfall 3.4: Ignoring Convergence Warnings

sklearn may emit warnings about convergence—don't ignore them!

Common Warnings and Fixes

•'n_iter should be at least 250': Increase n_iter to 1000+
•'early_exaggeration is deprecated': Update sklearn or use correct parameter name
•'Data contains NaN': Preprocess to handle missing values
•'perplexity is too large': Reduce perplexity to < N/3
•Memory errors: Reduce dataset size or use sparse methods

Pitfall Category 4: Interpretation Mistakes

We covered interpretation guidelines in detail on the previous page, but let's consolidate the most dangerous interpretation errors here.

Pitfall 4.1: Concluding Similarity from Distance

The Distance Fallacy

"Cluster A is close to Cluster B and far from Cluster C in the t-SNE plot, so A is more similar to B than to C."

This conclusion is INVALID. t-SNE does not preserve inter-cluster distances. The relative positions of clusters in the embedding are not meaningful. To make similarity claims, you must analyze the original feature space.

Pitfall 4.2: Interpreting Cluster Size as Variance

The Size Fallacy

"Cluster A appears larger than Cluster B, so A must have higher variance or be more diverse."

This conclusion is INVALID. Cluster sizes in t-SNE are artifacts of the algorithm, not reflections of data properties. A tight cluster in original space may appear large in t-SNE due to the heavy-tailed Student-t distribution.

Pitfall 4.3: Counting Clusters in t-SNE

The Counting Trap

"The t-SNE plot shows 7 distinct clusters, so our data has 7 natural groups."

This is RISKY. The number of visible clusters depends heavily on perplexity. Low perplexity artificially fragments continuous structures; high perplexity may merge distinct groups. Always validate cluster counts with multiple methods and perplexity values.

Pitfall 4.4: Single-Run Conclusions

The Single-Run Problem

•Different seeds can produce different cluster arrangements
•Different perplexities reveal different structure scales
•One embedding may show a local minimum, not global structure
•Unstable features may appear only in some runs

Solution: Always run with multiple seeds and perplexity values. Only claim structure that is consistent across runs.

Pitfall 4.5: t-SNE for Non-Visualization Tasks

Misuse of t-SNE Embeddings

t-SNE is designed for VISUALIZATION, not as a general-purpose dimensionality reduction. Using t-SNE embeddings as features for downstream ML tasks (clustering, classification) is usually a bad idea:

• Embeddings are not stable across runs • Distances are not meaningful for ML algorithms • New points cannot be projected (no transform method) • Better alternatives exist for feature extraction

Use PCA, UMAP, or autoencoders for feature extraction; use t-SNE only for visualization.

When NOT to Use t-SNE

t-SNE is not the right tool for every dimensionality reduction task. Understanding its limitations helps you choose the right method.

t-SNE is NOT suitable when:

When to Avoid t-SNE
Scenario	Why t-SNE Fails	Better Alternative
You need to project new points	No out-of-sample extension	UMAP, PCA, parametric t-SNE
You need meaningful distances	Distances are not preserved	MDS, UMAP, PCA
You need rotation-invariant features	Embeddings are rotation-arbitrary	PCA, autoencoders
Dataset has N > 100,000	Slow even with approximations	UMAP, FIt-SNE, sampling
You need reproducible ML features	Results vary across runs	PCA, UMAP with fixed seed
Global structure matters	Only preserves local neighborhoods	MDS, UMAP, Isomap

t-SNE IS suitable when:

Good Use Cases for t-SNE

•Exploratory visualization: Discovering cluster structure in high-D data
•Validating labeling: Checking if labeled classes separate well
•Presentation graphics: Creating compelling visuals for papers/talks
•Anomaly inspection: Visually identifying outliers in context
•Model comparison: Comparing embeddings from different representations
•Moderate dataset sizes: N in range 1,000-50,000

The Modern Alternative: UMAP

UMAP (Uniform Manifold Approximation and Projection) addresses many t-SNE limitations: faster, supports out-of-sample projection, better preserves global structure, GPU-accelerated implementations available. For many use cases, UMAP is now preferred. Consider UMAP as your first choice for new projects, with t-SNE as a complementary method.

The Complete t-SNE Checklist

Before presenting any t-SNE result, run through this comprehensive checklist.

Pre-Computation Checklist:

Before Running t-SNE

•Data scaled? Features normalized to comparable ranges (StandardScaler or MinMaxScaler)
•Duplicates removed? Checked for and removed exact duplicate points
•Missing values handled? No NaN or Inf values in data
•High dimensionality reduced? If D > 50, applied PCA first
•Dataset size appropriate? N < 50,000 or using appropriate implementation
•Perplexity chosen? Value < N/3, typically 15-50 for most datasets

During t-SNE Execution

•Sufficient iterations? n_iter ≥ 1000
•Appropriate learning rate? Using 'auto' or scaled appropriately
•Correct algorithm? Barnes-Hut for N > 500, exact for smaller datasets
•Random seed set? For reproducibility
•PCA initialization? init='pca' for stability
•Monitoring convergence? KL divergence decreasing and stabilizing

Post-Computation Validation

•Multiple seeds tested? Consistent structure across different initializations
•Multiple perplexities tested? Structure persists across parameter values
•Clusters validated? Confirmed in original feature space with appropriate metrics
•No distance claims made? Not interpreting inter-cluster distances
•No size claims made? Not interpreting visual cluster sizes
•Parameters documented? Perplexity, iterations, learning rate, seed recorded

t-SNE Best Practice Template
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import numpy as np
 
def robust_tsne(X, labels=None, perplexity=30, random_state=42):
    """
    Best-practice t-SNE implementation.
    
    Includes all recommended preprocessing and parameter settings.
    """
    # 1. Validate input
    assert not np.isnan(X).any(), "Data contains NaN values"
    assert not np.isinf(X).any(), "Data contains Inf values"
    N, D = X.shape
    print(f"Input: {N} samples, {D} features")
    
    # 2. Remove duplicates
    X_unique, unique_idx = np.unique(X, axis=0, return_index=True)
    if len(X_unique) < N:
        print(f"Removed {N - len(X_unique)} duplicate points")
        X = X_unique
        labels = labels[unique_idx] if labels is not None else None
        N = len(X)
    
    # 3. Validate perplexity
    assert perplexity < N / 3, f"Perplexity {perplexity} too high for N={N}"
    
    # 4. Scale features
    X_scaled = StandardScaler().fit_transform(X)
    
    # 5. Optional PCA for high-D data
    if D > 50:
        pca_dims = min(50, N-1)
        X_scaled = PCA(n_components=pca_dims).fit_transform(X_scaled)
        print(f"Reduced from {D} to {pca_dims} dims with PCA")
    
    # 6. Run t-SNE with best practices
    tsne = TSNE(
        n_components=2,
        perplexity=perplexity,
        n_iter=1000,
        learning_rate='auto',
        init='pca',
        random_state=random_state,
        verbose=1
    )
    
    Y = tsne.fit_transform(X_scaled)
    print(f"Final KL divergence: {tsne.kl_divergence_:.4f}")
    
    return Y, labels
 
# Usage
Y, labels = robust_tsne(X_raw, labels_raw, perplexity=30, random_state=42)

Summary: Avoiding t-SNE Pitfalls

We've cataloged the most common mistakes in t-SNE usage across parameters, preprocessing, computation, and interpretation. Let's consolidate the key lessons.

Key Takeaways

•Parameter sanity: Perplexity << N, n_iter ≥ 1000, learning_rate='auto'
•Preprocessing is essential: Scale features, remove duplicates, handle missing values, consider PCA for high-D
•Reproducibility matters: Set random seed; run multiple seeds to assess stability
•Match algorithm to scale: Barnes-Hut for N > 500; specialized implementations for N > 50k
•Interpretation discipline: No distance claims, no size claims, validate with original data
•Right tool for the job: Use t-SNE for visualization; consider UMAP or PCA for other tasks

Final Module Summary:

Over five pages, we've developed complete mastery of t-SNE:

Objective Function: Probabilistic neighborhood matching via KL divergence
Perplexity Parameter: Controlling effective neighborhood size via binary search
Optimization Challenges: Non-convexity, initialization, early exaggeration, Barnes-Hut
Interpretation Guidelines: What to trust, what to ignore, how to validate
Common Pitfalls: Practical mistakes and how to avoid them

With this knowledge, you can use t-SNE as a powerful, controlled visualization tool rather than a mysterious black box. Remember: t-SNE is for revealing local structure and generating hypotheses—not for proving claims about your data. Used correctly, it remains one of the most valuable tools in the machine learning practitioner's toolkit.

Module Complete

Congratulations! You have mastered t-SNE—from its elegant probabilistic foundations to its practical pitfalls. You can now use t-SNE with confidence, interpret its results correctly, and avoid the common mistakes that trip up even experienced practitioners. In the next module, we'll explore UMAP, a modern alternative that addresses many of t-SNE's limitations while preserving its visualization power.