Loading content...
t-SNE produces some of the most visually compelling dimensionality reduction plots in machine learning. This visual appeal is both a strength and a danger—it's easy to generate a beautiful plot that is fundamentally misleading or suboptimally computed.
After covering the theory, perplexity, optimization, and interpretation of t-SNE, this final page consolidates the practical pitfalls that trip up even experienced practitioners. We'll catalog common mistakes across four categories: parameter choices, data preprocessing, code-level errors, and interpretation failures.
Think of this page as a pre-flight checklist for t-SNE. Before presenting any t-SNE result, run through these pitfalls to ensure you haven't fallen into any of them.
By the end of this page, you will: • Recognize and avoid parameter-related pitfalls • Understand preprocessing requirements for t-SNE • Identify computational and implementation mistakes • Have a comprehensive checklist for t-SNE usage • Know when t-SNE is NOT the right tool
The most common errors involve misunderstanding or misusing t-SNE's hyperparameters.
Pitfall 1.1: Perplexity Too High or Too Low
| Error | Typical Range | Symptoms | Solution |
|---|---|---|---|
| Perplexity ≥ N/3 | Too high | Uniform, blob-like embedding; no clear structure | Reduce perplexity; ensure perplexity << N |
| Perplexity < 5 | Too low for most data | Fragmented clusters; noise appears as structure | Increase perplexity to 15-30 |
| Fixed perplexity for all datasets | Inflexible | Inconsistent results across different data | Adjust perplexity based on dataset size |
12345678910111213141516171819202122232425262728293031323334
def validate_perplexity(N, perplexity): """ Check if perplexity is appropriate for dataset size. """ issues = [] if perplexity >= N / 3: issues.append(f"Perplexity ({perplexity}) >= N/3 ({N/3:.0f}). " f"This is too high; results will be meaningless.") if perplexity >= N: issues.append(f"CRITICAL: Perplexity ({perplexity}) >= N ({N}). " f"This will fail or produce garbage.") if perplexity < 5: issues.append(f"Perplexity ({perplexity}) < 5. " f"May produce fragmented, noisy embeddings.") if N < 100 and perplexity > 30: issues.append(f"Small dataset (N={N}) with high perplexity ({perplexity}). " f"Consider reducing to {min(15, N//5)}.") if not issues: print(f"✓ Perplexity {perplexity} looks appropriate for N={N}") else: for issue in issues: print(f"⚠ {issue}") return issues # Example usagevalidate_perplexity(N=1000, perplexity=30) # OKvalidate_perplexity(N=50, perplexity=30) # Warningvalidate_perplexity(N=100, perplexity=40) # WarningPitfall 1.2: Insufficient Iterations
A very common mistake: not running enough iterations for the optimization to converge.
Some early tutorials and default settings used n_iter=250. This is far too few! The early exaggeration phase alone typically runs for 250 iterations. Running only 250 total iterations means almost no post-exaggeration optimization. Always use at least n_iter=1000 for reasonable results.
Signs of Insufficient Iterations:
Fix: Increase n_iter to 1000+ and monitor KL divergence to ensure convergence.
Pitfall 1.3: Default Learning Rate with Large Datasets
Older default learning rate of 200 can be suboptimal for larger datasets.
12345678910111213141516171819202122
from sklearn.manifold import TSNE # INCORRECT: Fixed learning rate for large dataset# This may cause slow convergence or instabilitytsne_bad = TSNE( n_components=2, perplexity=30, learning_rate=200.0, # Fixed value - may be inappropriate n_iter=1000) # CORRECT: Use automatic learning rate (sklearn 1.2+)tsne_good = TSNE( n_components=2, perplexity=30, learning_rate='auto', # Automatically scales with N n_iter=1000) # Alternative: Manual scaling for older versionsN = len(X)learning_rate = max(N / 12, 50) # Heuristic that scales with data sizeData preprocessing dramatically affects t-SNE results. Many practitioners underestimate its importance.
Pitfall 2.1: Unscaled Features
If features have vastly different scales, distances are dominated by large-scale features.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
from sklearn.preprocessing import StandardScaler, MinMaxScalerfrom sklearn.manifold import TSNEimport numpy as np def preprocess_for_tsne(X, method='standard', handle_outliers=True): """ Properly preprocess data for t-SNE. Args: X: Input data (N, D) method: 'standard' (z-score) or 'minmax' ([0,1]) handle_outliers: Clip extreme values before scaling Returns: X_preprocessed: Preprocessed data """ X = X.copy() # Step 1: Handle missing values if np.isnan(X).any(): print("Warning: Data contains NaN. Imputing with column means.") col_means = np.nanmean(X, axis=0) inds = np.where(np.isnan(X)) X[inds] = np.take(col_means, inds[1]) # Step 2: Handle outliers (optional but recommended) if handle_outliers: for col in range(X.shape[1]): q01 = np.percentile(X[:, col], 1) q99 = np.percentile(X[:, col], 99) X[:, col] = np.clip(X[:, col], q01, q99) # Step 3: Scale features if method == 'standard': scaler = StandardScaler() elif method == 'minmax': scaler = MinMaxScaler() else: raise ValueError(f"Unknown method: {method}") X_scaled = scaler.fit_transform(X) print(f"Preprocessed: {X.shape[0]} samples, {X.shape[1]} features") print(f"Scaling method: {method}") print(f"Feature ranges after scaling: " f"[{X_scaled.min():.2f}, {X_scaled.max():.2f}]") return X_scaled # UsageX_preprocessed = preprocess_for_tsne(X_raw, method='standard')tsne = TSNE(n_components=2, perplexity=30)Y = tsne.fit_transform(X_preprocessed)Pitfall 2.2: Ignoring High Dimensionality
t-SNE on very high-dimensional data (D > 50) can be slow and may give poor results due to the curse of dimensionality.
For D > 50 features, consider reducing to 50 dimensions with PCA before running t-SNE. This speeds up computation (N×D² for distance matrix) and can improve results by removing noisy dimensions. The original t-SNE paper recommends this approach.
1234567891011121314151617181920212223242526272829303132
from sklearn.decomposition import PCAfrom sklearn.manifold import TSNEfrom sklearn.pipeline import Pipelinefrom sklearn.preprocessing import StandardScaler def create_tsne_pipeline(target_pca_dims=50, perplexity=30): """ Create a preprocessing + t-SNE pipeline. Includes scaling, PCA reduction, and t-SNE. """ return Pipeline([ ('scaler', StandardScaler()), ('pca', PCA(n_components=target_pca_dims)), ('tsne', TSNE( n_components=2, perplexity=perplexity, n_iter=1000, learning_rate='auto', init='pca', random_state=42 )) ]) # For very high-dimensional data (e.g., D=10000)X_high_dim = load_high_dim_data() # Shape: (N, 10000) pipeline = create_tsne_pipeline(target_pca_dims=50, perplexity=30)Y = pipeline.fit_transform(X_high_dim) # This is MUCH faster than t-SNE on 10000 dims# Also often gives BETTER results (less noise)Pitfall 2.3: Duplicate Points
Exact duplicate points can cause numerical issues and misleading visualizations.
123456789101112131415161718192021222324252627282930313233
import numpy as np def remove_duplicates(X, labels=None, return_counts=False): """ Remove duplicate points before t-SNE. Returns: X_unique: Deduplicated data indices: Original indices of unique points counts: (optional) How many duplicates each unique point had """ # Find unique rows _, unique_indices, inverse, counts = np.unique( X, axis=0, return_index=True, return_inverse=True, return_counts=True ) n_duplicates = len(X) - len(unique_indices) if n_duplicates > 0: print(f"Removed {n_duplicates} duplicate points " f"({100*n_duplicates/len(X):.1f}%)") X_unique = X[unique_indices] # Handle labels if provided labels_unique = labels[unique_indices] if labels is not None else None if return_counts: return X_unique, unique_indices, counts[inverse] return X_unique, unique_indices, labels_unique # UsageX_clean, indices, labels_clean = remove_duplicates(X_raw, labels_raw)Y = tsne.fit_transform(X_clean)Even with correct parameters and preprocessing, computational issues can derail t-SNE.
Pitfall 3.1: Not Setting Random Seed
Without a fixed random seed, results are not reproducible.
1234567891011121314151617181920
from sklearn.manifold import TSNE # WRONG: No random state - results change every runtsne_unreproducible = TSNE(n_components=2, perplexity=30)Y1 = tsne_unreproducible.fit_transform(X) # Result AY2 = tsne_unreproducible.fit_transform(X) # Result B (different!) # CORRECT: Set random_state for reproducibilitytsne_reproducible = TSNE(n_components=2, perplexity=30, random_state=42)Y1 = tsne_reproducible.fit_transform(X) # Result AY2 = tsne_reproducible.fit_transform(X) # Result A (same!) # BEST PRACTICE: Run with multiple seeds to assess stabilityseeds = [42, 123, 456, 789, 101112]embeddings = []for seed in seeds: tsne = TSNE(n_components=2, perplexity=30, random_state=seed) Y = tsne.fit_transform(X) embeddings.append(Y)# Compare embeddings to identify stable vs. unstable structuresPitfall 3.2: Wrong Algorithm for Dataset Size
Using exact t-SNE on large datasets, or Barnes-Hut on very small datasets.
| Dataset Size | Recommended Method | Time Complexity | Notes |
|---|---|---|---|
| N < 500 | method='exact' or 'barnes_hut' | O(N²) | Either works; exact may be more accurate |
| 500 ≤ N < 50,000 | method='barnes_hut' | O(N log N) | Default in sklearn; good balance |
| N ≥ 50,000 | Use openTSNE or FIt-SNE | O(N) | sklearn may be slow; specialized implementations needed |
Pitfall 3.3: Memory Errors with Large Datasets
t-SNE requires O(N²) memory for the probability matrix in naive implementations.
For N=50,000 points: The full probability matrix would require 50,000² × 8 bytes = 20 GB! Modern implementations use sparse matrices (only storing k nearest neighbors), but memory can still be an issue. For very large N, consider subsampling or using out-of-core implementations.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
import numpy as npfrom sklearn.manifold import TSNE def tsne_with_subsample(X, labels=None, max_samples=10000, perplexity=30, random_state=42): """ Apply t-SNE with optional subsampling for very large datasets. For datasets larger than max_samples, randomly subsample. """ N = X.shape[0] if N <= max_samples: # Dataset is small enough - use all data tsne = TSNE(n_components=2, perplexity=perplexity, random_state=random_state, n_iter=1000) Y = tsne.fit_transform(X) return Y, np.arange(N) # Subsample np.random.seed(random_state) if labels is not None: # Stratified sampling to preserve class proportions from sklearn.model_selection import train_test_split _, X_sample, _, labels_sample = train_test_split( X, labels, train_size=max_samples, stratify=labels, random_state=random_state ) indices = np.arange(len(X))[train_test_split( np.arange(len(X)), train_size=max_samples, stratify=labels, random_state=random_state )[1]] else: # Random sampling indices = np.random.choice(N, max_samples, replace=False) X_sample = X[indices] print(f"Subsampled from {N} to {max_samples} points") tsne = TSNE(n_components=2, perplexity=perplexity, random_state=random_state, n_iter=1000) Y = tsne.fit_transform(X_sample) return Y, indices # Usage for large datasetY_sample, sample_indices = tsne_with_subsample( X_large, labels_large, max_samples=20000)Pitfall 3.4: Ignoring Convergence Warnings
sklearn may emit warnings about convergence—don't ignore them!
We covered interpretation guidelines in detail on the previous page, but let's consolidate the most dangerous interpretation errors here.
Pitfall 4.1: Concluding Similarity from Distance
"Cluster A is close to Cluster B and far from Cluster C in the t-SNE plot, so A is more similar to B than to C."
This conclusion is INVALID. t-SNE does not preserve inter-cluster distances. The relative positions of clusters in the embedding are not meaningful. To make similarity claims, you must analyze the original feature space.
Pitfall 4.2: Interpreting Cluster Size as Variance
"Cluster A appears larger than Cluster B, so A must have higher variance or be more diverse."
This conclusion is INVALID. Cluster sizes in t-SNE are artifacts of the algorithm, not reflections of data properties. A tight cluster in original space may appear large in t-SNE due to the heavy-tailed Student-t distribution.
Pitfall 4.3: Counting Clusters in t-SNE
"The t-SNE plot shows 7 distinct clusters, so our data has 7 natural groups."
This is RISKY. The number of visible clusters depends heavily on perplexity. Low perplexity artificially fragments continuous structures; high perplexity may merge distinct groups. Always validate cluster counts with multiple methods and perplexity values.
Pitfall 4.4: Single-Run Conclusions
Solution: Always run with multiple seeds and perplexity values. Only claim structure that is consistent across runs.
Pitfall 4.5: t-SNE for Non-Visualization Tasks
t-SNE is designed for VISUALIZATION, not as a general-purpose dimensionality reduction. Using t-SNE embeddings as features for downstream ML tasks (clustering, classification) is usually a bad idea:
• Embeddings are not stable across runs • Distances are not meaningful for ML algorithms • New points cannot be projected (no transform method) • Better alternatives exist for feature extraction
Use PCA, UMAP, or autoencoders for feature extraction; use t-SNE only for visualization.
t-SNE is not the right tool for every dimensionality reduction task. Understanding its limitations helps you choose the right method.
t-SNE is NOT suitable when:
| Scenario | Why t-SNE Fails | Better Alternative |
|---|---|---|
| You need to project new points | No out-of-sample extension | UMAP, PCA, parametric t-SNE |
| You need meaningful distances | Distances are not preserved | MDS, UMAP, PCA |
| You need rotation-invariant features | Embeddings are rotation-arbitrary | PCA, autoencoders |
| Dataset has N > 100,000 | Slow even with approximations | UMAP, FIt-SNE, sampling |
| You need reproducible ML features | Results vary across runs | PCA, UMAP with fixed seed |
| Global structure matters | Only preserves local neighborhoods | MDS, UMAP, Isomap |
t-SNE IS suitable when:
UMAP (Uniform Manifold Approximation and Projection) addresses many t-SNE limitations: faster, supports out-of-sample projection, better preserves global structure, GPU-accelerated implementations available. For many use cases, UMAP is now preferred. Consider UMAP as your first choice for new projects, with t-SNE as a complementary method.
Before presenting any t-SNE result, run through this comprehensive checklist.
Pre-Computation Checklist:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
from sklearn.preprocessing import StandardScalerfrom sklearn.decomposition import PCAfrom sklearn.manifold import TSNEimport numpy as np def robust_tsne(X, labels=None, perplexity=30, random_state=42): """ Best-practice t-SNE implementation. Includes all recommended preprocessing and parameter settings. """ # 1. Validate input assert not np.isnan(X).any(), "Data contains NaN values" assert not np.isinf(X).any(), "Data contains Inf values" N, D = X.shape print(f"Input: {N} samples, {D} features") # 2. Remove duplicates X_unique, unique_idx = np.unique(X, axis=0, return_index=True) if len(X_unique) < N: print(f"Removed {N - len(X_unique)} duplicate points") X = X_unique labels = labels[unique_idx] if labels is not None else None N = len(X) # 3. Validate perplexity assert perplexity < N / 3, f"Perplexity {perplexity} too high for N={N}" # 4. Scale features X_scaled = StandardScaler().fit_transform(X) # 5. Optional PCA for high-D data if D > 50: pca_dims = min(50, N-1) X_scaled = PCA(n_components=pca_dims).fit_transform(X_scaled) print(f"Reduced from {D} to {pca_dims} dims with PCA") # 6. Run t-SNE with best practices tsne = TSNE( n_components=2, perplexity=perplexity, n_iter=1000, learning_rate='auto', init='pca', random_state=random_state, verbose=1 ) Y = tsne.fit_transform(X_scaled) print(f"Final KL divergence: {tsne.kl_divergence_:.4f}") return Y, labels # UsageY, labels = robust_tsne(X_raw, labels_raw, perplexity=30, random_state=42)We've cataloged the most common mistakes in t-SNE usage across parameters, preprocessing, computation, and interpretation. Let's consolidate the key lessons.
Final Module Summary:
Over five pages, we've developed complete mastery of t-SNE:
With this knowledge, you can use t-SNE as a powerful, controlled visualization tool rather than a mysterious black box. Remember: t-SNE is for revealing local structure and generating hypotheses—not for proving claims about your data. Used correctly, it remains one of the most valuable tools in the machine learning practitioner's toolkit.
Congratulations! You have mastered t-SNE—from its elegant probabilistic foundations to its practical pitfalls. You can now use t-SNE with confidence, interpret its results correctly, and avoid the common mistakes that trip up even experienced practitioners. In the next module, we'll explore UMAP, a modern alternative that addresses many of t-SNE's limitations while preserving its visualization power.