Loading learning content...
You run your experiment, achieve 92% accuracy, and excitedly report the result. The next day, you run the same code again—91.3%. A colleague tries to reproduce your work—90.8%. Was the original result a fluke? Is there a bug? Or is this expected variability?
Reproducibility is the bedrock of scientific machine learning. Without it, we can't:
The random seed is the key that unlocks reproducibility. Understanding how randomness works in ML—and how to control it—transforms chaotic experimentation into rigorous science.
This page covers pseudorandom number generation, sources of randomness in ML pipelines, seed management across libraries and frameworks, reproducibility challenges with parallelism and GPUs, experiment tracking, and production-grade patterns. You'll master reproducibility at the level expected of senior ML engineers.
Computers are deterministic machines—given the same inputs, they produce the same outputs. So how do they generate 'random' numbers? The answer lies in pseudorandom number generators (PRNGs).
The PRNG Concept
A PRNG is a deterministic algorithm that produces a sequence of numbers that appear random but are entirely determined by an initial value called the seed.
Mathematical Definition:
A PRNG defines a recurrence relation: $$X_{n+1} = f(X_n)$$
where:
Given the same seed $X_0$, the sequence $X_0, X_1, X_2, \ldots$ is identical across runs.
Common PRNGs in ML:
Mersenne Twister (MT19937): Default in NumPy/Python. Period of $2^{19937}-1$. Not cryptographically secure but excellent statistical properties.
PCG (Permuted Congruential Generator): Modern, faster than MT, better statistical properties. Default in NumPy 1.17+.
Xorshift/Xoshiro: Very fast, good quality. Used in some frameworks.
Hardware RNG: True randomness from physical processes. Typically used only for seeding, not bulk generation.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
import numpy as np # ============================================# PRNG Fundamentals: Same Seed = Same Sequence# ============================================ # Without setting seed - different results each runprint("Without seed (will vary):")print(np.random.random(5)) # With seed - identical results every runprint("\nWith seed 42 (always identical):")np.random.seed(42)result1 = np.random.random(5)print(result1) # Reset and try again - exact same sequencenp.random.seed(42)result2 = np.random.random(5)print(result2) print(f"\nArrays identical: {np.array_equal(result1, result2)}") # ============================================# The Sequence is Deterministic# ============================================np.random.seed(42)sequence = []for i in range(10): sequence.append(np.random.random()) print("\nSequential calls produce deterministic sequence:")print(sequence[:5]) # Same result using array generationnp.random.seed(42)array_result = np.random.random(10)print(array_result[:5])print(f"Sequential == Array: {np.allclose(sequence, array_result)}") # ============================================# Modern NumPy: Generator Objects# ============================================# Recommended approach since NumPy 1.17# More explicit, better reproducibility rng = np.random.default_rng(seed=42)print("\nUsing Generator object:")print(rng.random(5)) # Can create multiple independent generatorsrng1 = np.random.default_rng(seed=1001)rng2 = np.random.default_rng(seed=1002) # These produce different sequencesprint(f"\nGenerator 1: {rng1.random(3)}")print(f"Generator 2: {rng2.random(3)}")For ML purposes, pseudorandomness is perfect—we don't need cryptographic unpredictability, just good statistical properties. In fact, true randomness would be problematic: we'd lose reproducibility. The only time true randomness matters is when seeding the PRNG itself for security-critical applications.
Machine learning pipelines have randomness scattered throughout. Understanding all sources is essential for achieving true reproducibility.
The Many Faces of Randomness:
| Stage | Source | Affected By | Controlled By |
|---|---|---|---|
| Data Splitting | Train/val/test partition | Which samples in which set | random_state in train_test_split |
| Shuffling | Data order for training | Batch composition, gradient noise | shuffle seeds in data loaders |
| Initialization | Neural network weights | Starting point for optimization | Framework-specific initialization seeds |
| Dropout | Which neurons are dropped | Regularization pattern | Training seed (per-forward-pass) |
| Data Augmentation | Random transforms (flip, crop, etc.) | Training sample variation | Augmentation pipeline seeds |
| Stochastic Algorithms | Random forests, SGD batches | Model structure, gradient estimates | Algorithm-specific random_state |
| Cross-validation | Fold assignments | Which samples in which fold | random_state in CV splitters |
| Hyperparameter Search | Random/Bayesian sampling | Which configurations explored | Search method seeds |
The Cascade Effect
Randomness early in the pipeline affects everything downstream:
Even with 99% of the pipeline controlled, one uncontrolled source can cause variation.
Library-Specific Random States
Different libraries maintain separate random states:
random module: Global state, used by some utilitiesControlling only NumPy doesn't control PyTorch or TensorFlow!
Some randomness is hidden: hash randomization in Python 3.3+, threading-dependent operations, filesystem ordering, and non-deterministic algorithms in cuDNN. Achieving perfect reproducibility requires controlling all these sources—sometimes impossible without disabling optimizations.
Proper seed setting requires a comprehensive approach that covers all relevant libraries. Here's the production pattern.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115
import osimport randomimport numpy as npfrom typing import Optional def set_seed(seed: int, deterministic: bool = False) -> None: """ Set random seeds for reproducibility across all libraries. Parameters: ----------- seed : Random seed (integer) deterministic : If True, use deterministic algorithms (may be slower) """ # Python's built-in random random.seed(seed) # NumPy np.random.seed(seed) # Environment variable for hash randomization os.environ['PYTHONHASHSEED'] = str(seed) # PyTorch (if available) try: import torch torch.manual_seed(seed) torch.cuda.manual_seed(seed) torch.cuda.manual_seed_all(seed) # For multi-GPU if deterministic: # Deterministic algorithms (slower but reproducible) torch.backends.cudnn.deterministic = True torch.backends.cudnn.benchmark = False # For PyTorch 1.8+ torch.use_deterministic_algorithms(True) except ImportError: pass # TensorFlow (if available) try: import tensorflow as tf tf.random.set_seed(seed) if deterministic: # TensorFlow 2.8+ tf.config.experimental.enable_op_determinism() except ImportError: pass print(f"Seeds set to {seed}" + (" (deterministic mode)" if deterministic else "")) def get_seeded_generator(seed: int) -> np.random.Generator: """ Create an independent seeded generator for use in specific operations. Preferred over global seed for modular code. """ return np.random.default_rng(seed) # ============================================# Usage Pattern: Script Entry Point# ============================================if __name__ == "__main__": # Set global seed at the very start GLOBAL_SEED = 42 set_seed(GLOBAL_SEED) # For specific operations, create independent generators data_split_rng = get_seeded_generator(GLOBAL_SEED + 1) augmentation_rng = get_seeded_generator(GLOBAL_SEED + 2) # Now your experiment is reproducible # ... # ============================================# Scikit-learn Pattern: random_state Everywhere# ============================================from sklearn.model_selection import train_test_split, StratifiedKFoldfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.linear_model import LogisticRegression SEED = 42 # Every scikit-learn operation that involves randomness# should have random_state explicitly set # Data splittingX_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=SEED, # Explicit seed stratify=y) # Cross-validationcv = StratifiedKFold( n_splits=5, shuffle=True, random_state=SEED # Explicit seed) # Models with randomnessrf = RandomForestClassifier( n_estimators=100, random_state=SEED # Explicit seed) # Even deterministic-seeming models may have implementation randomnesslr = LogisticRegression( random_state=SEED, # Solver randomness solver='saga' # Stochastic solver)Reproducibility exists on a spectrum. Understanding the levels helps you make informed tradeoffs between reproducibility and performance.
Level 0: No Reproducibility No seeds set. Results vary unpredictably. Debugging is impossible.
Level 1: Statistical Reproducibility Multiple runs averaged. Individual runs vary, but aggregate statistics are stable. Useful for final reporting but hard to debug.
Level 2: Soft Reproducibility Seeds set for main sources (data splits, model initialization). Minor numerical differences may occur due to floating-point non-associativity or threading.
Level 3: Hard Reproducibility All seeds set, deterministic algorithms enabled. Results identical within same environment. Small performance cost.
Level 4: Cross-Environment Reproducibility Identical results across different machines, OS versions, library versions. Requires containerization and version pinning. Significant overhead.
| Level | Effort | Performance Impact | Use Case |
|---|---|---|---|
| Level 0 | None | None | Prototyping only (not recommended) |
| Level 1 | Low | None | Research reporting (aggregate results) |
| Level 2 | Moderate | Minimal | Standard development and CI/CD |
| Level 3 | High | 5-20% slower | Debugging, regulated industries |
| Level 4 | Very High | Variable | Publication, legal requirements |
Most production ML teams operate at Level 2-3. Level 4 is reserved for auditing and publication. During active development, Level 2 suffices. When debugging a specific issue, temporarily enable Level 3. Level 4 is activated for final experiments before publication.
Why Perfect Reproducibility is Hard
Floating-Point Non-Associativity: $(a + b) + c \neq a + (b + c)$ in floating-point arithmetic. Parallel reduction order can change results.
Hardware Differences: Different GPUs, CPU instruction sets, or even CPU generations can give different results.
Non-Deterministic Algorithms: Some optimized algorithms trade determinism for speed. cuDNN's autotuner selects different algorithms per run.
Library Version Differences: Algorithms change between versions. NumPy 1.16 and 1.20 may give different results for same seed.
Operating System Differences: File system ordering, thread scheduling, and default precision vary across platforms.
Practical Implication: Accept that bit-for-bit reproducibility is often impractical. Instead, aim for 'statistical reproducibility'—small numerical differences that don't affect conclusions.
As experiments grow complex with many random components, seed management becomes a design challenge. Here are production-tested patterns.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179
from dataclasses import dataclassfrom typing import Dict, Optionalimport hashlibimport json @dataclassclass SeedConfig: """ Centralized seed configuration for reproducible experiments. Derives component-specific seeds from a master seed. """ master_seed: int @property def data_split_seed(self) -> int: """Seed for train/val/test splitting.""" return self._derive_seed("data_split") @property def model_init_seed(self) -> int: """Seed for model weight initialization.""" return self._derive_seed("model_init") @property def training_seed(self) -> int: """Seed for training (shuffling, dropout, etc.).""" return self._derive_seed("training") @property def augmentation_seed(self) -> int: """Seed for data augmentation.""" return self._derive_seed("augmentation") @property def cv_seed(self) -> int: """Seed for cross-validation folds.""" return self._derive_seed("cv") def _derive_seed(self, component: str) -> int: """ Derive a component-specific seed from master seed. Uses hash function for good distribution. """ combined = f"{self.master_seed}:{component}" hash_bytes = hashlib.sha256(combined.encode()).digest() # Use first 4 bytes as integer seed return int.from_bytes(hash_bytes[:4], 'big') % (2**31) def to_dict(self) -> Dict[str, int]: """Export all seeds for logging/tracking.""" return { 'master_seed': self.master_seed, 'data_split_seed': self.data_split_seed, 'model_init_seed': self.model_init_seed, 'training_seed': self.training_seed, 'augmentation_seed': self.augmentation_seed, 'cv_seed': self.cv_seed } def __repr__(self): return f"SeedConfig(master_seed={self.master_seed})" # ============================================# Usage Pattern# ============================================from sklearn.model_selection import train_test_split, StratifiedKFold # Create seed configuration from single master seedseeds = SeedConfig(master_seed=42)print(f"Seed configuration: {seeds.to_dict()}") # Use derived seeds for each componentX_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=seeds.data_split_seed, # Component-specific seed stratify=y) cv = StratifiedKFold( n_splits=5, shuffle=True, random_state=seeds.cv_seed # Different component, different seed) # ============================================# Experiment Tracking with Seeds# ============================================class ExperimentTracker: """Track experiments with their complete seed configurations.""" def __init__(self, experiment_name: str): self.experiment_name = experiment_name self.runs = [] def log_run( self, seed_config: SeedConfig, hyperparams: Dict, metrics: Dict, notes: str = "" ): """Log a run with its seed configuration.""" run_record = { 'run_id': len(self.runs) + 1, 'seeds': seed_config.to_dict(), 'hyperparams': hyperparams, 'metrics': metrics, 'notes': notes } self.runs.append(run_record) return run_record def get_run(self, run_id: int) -> Optional[Dict]: """Retrieve a run by ID for reproduction.""" for run in self.runs: if run['run_id'] == run_id: return run return None def reproduce_run(self, run_id: int) -> SeedConfig: """Get the seed config needed to reproduce a run.""" run = self.get_run(run_id) if run: return SeedConfig(master_seed=run['seeds']['master_seed']) raise ValueError(f"Run {run_id} not found") def save(self, filepath: str): """Save experiment history to JSON.""" with open(filepath, 'w') as f: json.dump({ 'experiment_name': self.experiment_name, 'runs': self.runs }, f, indent=2) # ============================================# Multiple Independent Runs Pattern# ============================================def run_experiment_with_multiple_seeds( base_seed: int = 42, n_runs: int = 5) -> list: """ Run experiment multiple times with different seeds. Essential for understanding variance due to randomness. """ all_results = [] for run in range(n_runs): # Each run gets a different master seed run_seed = base_seed + run * 1000 seeds = SeedConfig(master_seed=run_seed) # Set up reproducibility for this run set_seed(run_seed) # Run the experiment result = run_single_experiment(seeds) all_results.append({ 'run_id': run, 'master_seed': run_seed, 'metrics': result }) print(f"Run {run+1}/{n_runs}: Accuracy = {result['accuracy']:.4f}") # Compute summary statistics accuracies = [r['metrics']['accuracy'] for r in all_results] print(f"\nSummary ({n_runs} runs):") print(f" Mean: {np.mean(accuracies):.4f}") print(f" Std: {np.std(accuracies):.4f}") print(f" Min: {np.min(accuracies):.4f}") print(f" Max: {np.max(accuracies):.4f}") return all_resultsDeriving component seeds from a master seed means you only need to record one number to reproduce an entire experiment. Changing the master seed changes everything consistently; same master seed gives identical experiments. This is the gold standard for ML reproducibility.
Deep learning introduces additional reproducibility challenges beyond traditional ML. GPU operations, multi-threading, and non-deterministic optimizations create a perfect storm for irreproducibility.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163
import osimport randomimport numpy as np # ============================================# PyTorch Full Reproducibility Setup# ============================================def setup_pytorch_reproducibility(seed: int, strict: bool = True): """ Configure PyTorch for reproducible results. Parameters: ----------- seed : Random seed strict : If True, use fully deterministic algorithms (slower) """ import torch # Basic seeds random.seed(seed) np.random.seed(seed) torch.manual_seed(seed) # CUDA seeds if torch.cuda.is_available(): torch.cuda.manual_seed(seed) torch.cuda.manual_seed_all(seed) if strict: # Deterministic cuDNN (slower but reproducible) torch.backends.cudnn.deterministic = True torch.backends.cudnn.benchmark = False # Python hash seed os.environ['PYTHONHASHSEED'] = str(seed) if strict: # PyTorch 1.8+ deterministic mode # Raises error for non-deterministic operations try: torch.use_deterministic_algorithms(True) # Environment variable for CUDA determinism os.environ['CUBLAS_WORKSPACE_CONFIG'] = ':4096:8' except AttributeError: # Older PyTorch version pass return seed # ============================================# TensorFlow/Keras Full Reproducibility Setup# ============================================def setup_tensorflow_reproducibility(seed: int, strict: bool = True): """ Configure TensorFlow for reproducible results. """ import tensorflow as tf # Basic seeds random.seed(seed) np.random.seed(seed) tf.random.set_seed(seed) # Python hash seed os.environ['PYTHONHASHSEED'] = str(seed) # TensorFlow determinism os.environ['TF_DETERMINISTIC_OPS'] = '1' if strict: # TensorFlow 2.8+ deterministic mode try: tf.config.experimental.enable_op_determinism() except AttributeError: pass # Disable GPU memory growth randomness try: gpus = tf.config.experimental.list_physical_devices('GPU') for gpu in gpus: tf.config.experimental.set_memory_growth(gpu, False) except: pass return seed # ============================================# Checking Reproducibility# ============================================def verify_pytorch_reproducibility(seed: int = 42, n_trials: int = 3): """ Verify that PyTorch training is reproducible. """ import torch import torch.nn as nn results = [] for trial in range(n_trials): # Reset everything setup_pytorch_reproducibility(seed, strict=True) # Simple model and forward pass model = nn.Linear(10, 2) x = torch.randn(5, 10) output = model(x) # Record result results.append(output.detach().numpy().copy()) # Check all trials match all_match = all( np.allclose(results[0], results[i]) for i in range(1, n_trials) ) print(f"Reproducibility check: {'PASS' if all_match else 'FAIL'}") if not all_match: print("First output:", results[0]) for i in range(1, n_trials): print(f"Trial {i+1} output:", results[i]) return all_match # ============================================# DataLoader Reproducibility# ============================================def create_reproducible_dataloader( dataset, batch_size: int, seed: int, num_workers: int = 0): """ Create a DataLoader with reproducible shuffling. Note: num_workers > 0 can break reproducibility unless worker_init_fn is properly set. """ import torch from torch.utils.data import DataLoader def seed_worker(worker_id): """Initialize each worker with a seed.""" worker_seed = seed + worker_id np.random.seed(worker_seed) random.seed(worker_seed) generator = torch.Generator() generator.manual_seed(seed) return DataLoader( dataset, batch_size=batch_size, shuffle=True, num_workers=num_workers, worker_init_fn=seed_worker, generator=generator, pin_memory=False # True can cause non-determinism )Enabling strict determinism (cudnn.deterministic=True, use_deterministic_algorithms) can slow training by 10-30%. The cuDNN autotuner finds fast algorithms that may be non-deterministic. Trade off reproducibility against training time based on your needs.
Beyond setting seeds, proper experiment tracking ensures you can always reproduce past results. This is essential for debugging, publication, and production deployment.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152
import jsonimport hashlibfrom datetime import datetimefrom dataclasses import dataclass, asdictfrom typing import Dict, Any, Optionalimport subprocess @dataclassclass ExperimentConfig: """Complete configuration for reproducible experiments.""" # Seeds master_seed: int # Data configuration data_path: str test_size: float stratify: bool # Model configuration model_type: str hyperparameters: Dict[str, Any] # Training configuration epochs: int batch_size: int learning_rate: float # Environment information (auto-captured) git_commit: Optional[str] = None python_version: Optional[str] = None package_versions: Optional[Dict[str, str]] = None timestamp: Optional[str] = None def __post_init__(self): """Capture environment information.""" self.timestamp = datetime.now().isoformat() self.python_version = self._get_python_version() self.git_commit = self._get_git_commit() self.package_versions = self._get_package_versions() def _get_python_version(self) -> str: import sys return sys.version def _get_git_commit(self) -> Optional[str]: try: result = subprocess.run( ['git', 'rev-parse', 'HEAD'], capture_output=True, text=True, check=True ) return result.stdout.strip() except: return None def _get_package_versions(self) -> Dict[str, str]: """Capture versions of key packages.""" versions = {} packages = ['numpy', 'pandas', 'sklearn', 'torch', 'tensorflow'] for pkg in packages: try: module = __import__(pkg) versions[pkg] = getattr(module, '__version__', 'unknown') except ImportError: pass return versions def get_hash(self) -> str: """Get a hash of the configuration for identification.""" config_str = json.dumps(asdict(self), sort_keys=True, default=str) return hashlib.sha256(config_str.encode()).hexdigest()[:16] def save(self, filepath: str): """Save configuration to JSON file.""" with open(filepath, 'w') as f: json.dump(asdict(self), f, indent=2, default=str) @classmethod def load(cls, filepath: str) -> 'ExperimentConfig': """Load configuration from JSON file.""" with open(filepath, 'r') as f: data = json.load(f) return cls(**data) # ============================================# Integration with MLflow (example)# ============================================def log_experiment_mlflow(config: ExperimentConfig, metrics: Dict[str, float]): """ Log experiment to MLflow for tracking. """ try: import mlflow with mlflow.start_run(): # Log seeds mlflow.log_param("master_seed", config.master_seed) # Log configuration for key, value in config.hyperparameters.items(): mlflow.log_param(f"hp_{key}", value) mlflow.log_param("model_type", config.model_type) mlflow.log_param("test_size", config.test_size) mlflow.log_param("epochs", config.epochs) mlflow.log_param("batch_size", config.batch_size) mlflow.log_param("learning_rate", config.learning_rate) # Log environment mlflow.log_param("git_commit", config.git_commit) # Log metrics for key, value in metrics.items(): mlflow.log_metric(key, value) # Log full config as artifact config_path = f"/tmp/config_{config.get_hash()}.json" config.save(config_path) mlflow.log_artifact(config_path) print(f"Logged experiment: {config.get_hash()}") except ImportError: print("MLflow not installed, skipping logging") # ============================================# Usage Example# ============================================# Create experiment configurationconfig = ExperimentConfig( master_seed=42, data_path="/data/my_dataset.csv", test_size=0.2, stratify=True, model_type="RandomForest", hyperparameters={ "n_estimators": 100, "max_depth": 10, "min_samples_split": 5 }, epochs=1, # Not used for RF but included for completeness batch_size=32, learning_rate=0.001) print(f"Experiment hash: {config.get_hash()}")print(f"Git commit: {config.git_commit}")print(f"Timestamp: {config.timestamp}")Reproducibility through proper seed management is non-negotiable for professional ML. Let's consolidate the essential principles:
The Researcher's Checklist:
☐ Master seed defined as a constant at script top
☐ All libraries seeded at program start
☐ Every random_state parameter explicitly set
☐ Experiment config includes all seeds
☐ Git commit recorded with each run
☐ Package versions pinned and recorded
☐ Multiple seeds run for statistical reporting
☐ Reproducibility verified with test runs
Looking Ahead: Limitations
While seeds enable reproducibility, holdout validation still has fundamental limitations: variance from the single split, wasted data in the test set, and sensitivity to the particular random split. The next page examines these limitations in detail, motivating cross-validation approaches.
You now understand random seeds and reproducibility at production depth—from PRNG theory to framework-specific setup to experiment tracking. Next, we'll examine the fundamental limitations of holdout validation, setting the stage for cross-validation methods.