Machine LearningCross-Validation & Resampling

Holdout Validation

LevelIntermediate

Duration90 mins

TopicCross-Validation & Resampling

4 / 5

Random Seeds

The Reproducibility Imperative

You run your experiment, achieve 92% accuracy, and excitedly report the result. The next day, you run the same code again—91.3%. A colleague tries to reproduce your work—90.8%. Was the original result a fluke? Is there a bug? Or is this expected variability?

Reproducibility is the bedrock of scientific machine learning. Without it, we can't:

Debug experiments (can't fix what we can't reproduce)
Compare methods fairly (different runs give different results)
Collaborate effectively (others can't replicate our work)
Satisfy auditing requirements (regulators demand reproducibility)
Deploy confidently (training must match development)

The random seed is the key that unlocks reproducibility. Understanding how randomness works in ML—and how to control it—transforms chaotic experimentation into rigorous science.

What You Will Learn

This page covers pseudorandom number generation, sources of randomness in ML pipelines, seed management across libraries and frameworks, reproducibility challenges with parallelism and GPUs, experiment tracking, and production-grade patterns. You'll master reproducibility at the level expected of senior ML engineers.

How Randomness Works in Computing

Computers are deterministic machines—given the same inputs, they produce the same outputs. So how do they generate 'random' numbers? The answer lies in pseudorandom number generators (PRNGs).

The PRNG Concept

A PRNG is a deterministic algorithm that produces a sequence of numbers that appear random but are entirely determined by an initial value called the seed.

Mathematical Definition:

A PRNG defines a recurrence relation: $$X_{n+1} = f(X_n)$$

where:

$X_0$ is the seed (initial state)
$f$ is the transition function
Each $X_n$ determines the 'random' output

Given the same seed $X_0$, the sequence $X_0, X_1, X_2, \ldots$ is identical across runs.

Common PRNGs in ML:

Mersenne Twister (MT19937): Default in NumPy/Python. Period of $2^{19937}-1$. Not cryptographically secure but excellent statistical properties.
PCG (Permuted Congruential Generator): Modern, faster than MT, better statistical properties. Default in NumPy 1.17+.
Xorshift/Xoshiro: Very fast, good quality. Used in some frameworks.
Hardware RNG: True randomness from physical processes. Typically used only for seeding, not bulk generation.

prng_demonstration
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import numpy as np
 
# ============================================
# PRNG Fundamentals: Same Seed = Same Sequence
# ============================================
 
# Without setting seed - different results each run
print("Without seed (will vary):")
print(np.random.random(5))
 
# With seed - identical results every run
print("\nWith seed 42 (always identical):")
np.random.seed(42)
result1 = np.random.random(5)
print(result1)
 
# Reset and try again - exact same sequence
np.random.seed(42)
result2 = np.random.random(5)
print(result2)
 
print(f"\nArrays identical: {np.array_equal(result1, result2)}")
 
# ============================================
# The Sequence is Deterministic
# ============================================
np.random.seed(42)
sequence = []
for i in range(10):
    sequence.append(np.random.random())
    
print("\nSequential calls produce deterministic sequence:")
print(sequence[:5])
 
# Same result using array generation
np.random.seed(42)
array_result = np.random.random(10)
print(array_result[:5])
print(f"Sequential == Array: {np.allclose(sequence, array_result)}")
 
# ============================================
# Modern NumPy: Generator Objects
# ============================================
# Recommended approach since NumPy 1.17
# More explicit, better reproducibility
 
rng = np.random.default_rng(seed=42)
print("\nUsing Generator object:")
print(rng.random(5))
 
# Can create multiple independent generators
rng1 = np.random.default_rng(seed=1001)
rng2 = np.random.default_rng(seed=1002)
 
# These produce different sequences
print(f"\nGenerator 1: {rng1.random(3)}")
print(f"Generator 2: {rng2.random(3)}")

Pseudo vs. True Randomness

For ML purposes, pseudorandomness is perfect—we don't need cryptographic unpredictability, just good statistical properties. In fact, true randomness would be problematic: we'd lose reproducibility. The only time true randomness matters is when seeding the PRNG itself for security-critical applications.

Sources of Randomness in ML Pipelines

Machine learning pipelines have randomness scattered throughout. Understanding all sources is essential for achieving true reproducibility.

The Many Faces of Randomness:

Sources of Randomness in ML Pipelines
Stage	Source	Affected By	Controlled By
Data Splitting	Train/val/test partition	Which samples in which set	`random_state` in `train_test_split`
Shuffling	Data order for training	Batch composition, gradient noise	`shuffle` seeds in data loaders
Initialization	Neural network weights	Starting point for optimization	Framework-specific initialization seeds
Dropout	Which neurons are dropped	Regularization pattern	Training seed (per-forward-pass)
Data Augmentation	Random transforms (flip, crop, etc.)	Training sample variation	Augmentation pipeline seeds
Stochastic Algorithms	Random forests, SGD batches	Model structure, gradient estimates	Algorithm-specific `random_state`
Cross-validation	Fold assignments	Which samples in which fold	`random_state` in CV splitters
Hyperparameter Search	Random/Bayesian sampling	Which configurations explored	Search method seeds

The Cascade Effect

Randomness early in the pipeline affects everything downstream:

Different data split → different training samples → different model
Different weight initialization → different optimization path → different model
Different shuffle order → different batches → different model

Even with 99% of the pipeline controlled, one uncontrolled source can cause variation.

Library-Specific Random States

Different libraries maintain separate random states:

Python's random module: Global state, used by some utilities
NumPy: Used by scikit-learn, pandas, and many others
PyTorch: Separate CPU and GPU states
TensorFlow: Session-level and global-level states
CUDA: GPU-specific randomness (cuDNN algorithms)

Controlling only NumPy doesn't control PyTorch or TensorFlow!

Hidden Randomness Sources

Some randomness is hidden: hash randomization in Python 3.3+, threading-dependent operations, filesystem ordering, and non-deterministic algorithms in cuDNN. Achieving perfect reproducibility requires controlling all these sources—sometimes impossible without disabling optimizations.

Setting Seeds Properly

Proper seed setting requires a comprehensive approach that covers all relevant libraries. Here's the production pattern.

comprehensive_seed_setting
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
import os
import random
import numpy as np
from typing import Optional
 
def set_seed(seed: int, deterministic: bool = False) -> None:
    """
    Set random seeds for reproducibility across all libraries.
    
    Parameters:
    -----------
    seed : Random seed (integer)
    deterministic : If True, use deterministic algorithms (may be slower)
    """
    # Python's built-in random
    random.seed(seed)
    
    # NumPy
    np.random.seed(seed)
    
    # Environment variable for hash randomization
    os.environ['PYTHONHASHSEED'] = str(seed)
    
    # PyTorch (if available)
    try:
        import torch
        torch.manual_seed(seed)
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)  # For multi-GPU
        
        if deterministic:
            # Deterministic algorithms (slower but reproducible)
            torch.backends.cudnn.deterministic = True
            torch.backends.cudnn.benchmark = False
            # For PyTorch 1.8+
            torch.use_deterministic_algorithms(True)
    except ImportError:
        pass
    
    # TensorFlow (if available)
    try:
        import tensorflow as tf
        tf.random.set_seed(seed)
        
        if deterministic:
            # TensorFlow 2.8+
            tf.config.experimental.enable_op_determinism()
    except ImportError:
        pass
    
    print(f"Seeds set to {seed}" + (" (deterministic mode)" if deterministic else ""))
 
 
def get_seeded_generator(seed: int) -> np.random.Generator:
    """
    Create an independent seeded generator for use in specific operations.
    Preferred over global seed for modular code.
    """
    return np.random.default_rng(seed)
 
 
# ============================================
# Usage Pattern: Script Entry Point
# ============================================
if __name__ == "__main__":
    # Set global seed at the very start
    GLOBAL_SEED = 42
    set_seed(GLOBAL_SEED)
    
    # For specific operations, create independent generators
    data_split_rng = get_seeded_generator(GLOBAL_SEED + 1)
    augmentation_rng = get_seeded_generator(GLOBAL_SEED + 2)
    
    # Now your experiment is reproducible
    # ...
 
 
# ============================================
# Scikit-learn Pattern: random_state Everywhere
# ============================================
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
 
SEED = 42
 
# Every scikit-learn operation that involves randomness
# should have random_state explicitly set
 
# Data splitting
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=SEED,  # Explicit seed
    stratify=y
)
 
# Cross-validation
cv = StratifiedKFold(
    n_splits=5,
    shuffle=True,
    random_state=SEED  # Explicit seed
)
 
# Models with randomness
rf = RandomForestClassifier(
    n_estimators=100,
    random_state=SEED  # Explicit seed
)
 
# Even deterministic-seeming models may have implementation randomness
lr = LogisticRegression(
    random_state=SEED,  # Solver randomness
    solver='saga'       # Stochastic solver
)

Seed Setting Best Practices

•Set seeds at the very start — Before any random operations occur, ideally as the first lines of main()
•Use explicit random_state everywhere — Never rely on global state; pass seeds to every function that accepts them
•Document the seed — Include in experiment logs, code comments, and published results
•Use different seeds for different purposes — e.g., seed+0 for splits, seed+1 for model, seed+2 for augmentation
•Test reproducibility — Run experiments twice and verify identical results before trusting your setup
•Avoid magic numbers — Define seeds as constants at the top of your code, not scattered literals

The Reproducibility Hierarchy

Reproducibility exists on a spectrum. Understanding the levels helps you make informed tradeoffs between reproducibility and performance.

Level 0: No Reproducibility No seeds set. Results vary unpredictably. Debugging is impossible.

Level 1: Statistical Reproducibility Multiple runs averaged. Individual runs vary, but aggregate statistics are stable. Useful for final reporting but hard to debug.

Level 2: Soft Reproducibility Seeds set for main sources (data splits, model initialization). Minor numerical differences may occur due to floating-point non-associativity or threading.

Level 3: Hard Reproducibility All seeds set, deterministic algorithms enabled. Results identical within same environment. Small performance cost.

Level 4: Cross-Environment Reproducibility Identical results across different machines, OS versions, library versions. Requires containerization and version pinning. Significant overhead.

Reproducibility Levels and Trade-offs
Level	Effort	Performance Impact	Use Case
Level 0	None	None	Prototyping only (not recommended)
Level 1	Low	None	Research reporting (aggregate results)
Level 2	Moderate	Minimal	Standard development and CI/CD
Level 3	High	5-20% slower	Debugging, regulated industries
Level 4	Very High	Variable	Publication, legal requirements

Choose the Right Level

Most production ML teams operate at Level 2-3. Level 4 is reserved for auditing and publication. During active development, Level 2 suffices. When debugging a specific issue, temporarily enable Level 3. Level 4 is activated for final experiments before publication.

Why Perfect Reproducibility is Hard

Floating-Point Non-Associativity: $(a + b) + c \neq a + (b + c)$ in floating-point arithmetic. Parallel reduction order can change results.
Hardware Differences: Different GPUs, CPU instruction sets, or even CPU generations can give different results.
Non-Deterministic Algorithms: Some optimized algorithms trade determinism for speed. cuDNN's autotuner selects different algorithms per run.
Library Version Differences: Algorithms change between versions. NumPy 1.16 and 1.20 may give different results for same seed.
Operating System Differences: File system ordering, thread scheduling, and default precision vary across platforms.

Practical Implication: Accept that bit-for-bit reproducibility is often impractical. Instead, aim for 'statistical reproducibility'—small numerical differences that don't affect conclusions.

Seed Management Strategies

As experiments grow complex with many random components, seed management becomes a design challenge. Here are production-tested patterns.

seed_management
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
from dataclasses import dataclass
from typing import Dict, Optional
import hashlib
import json
 
@dataclass
class SeedConfig:
    """
    Centralized seed configuration for reproducible experiments.
    Derives component-specific seeds from a master seed.
    """
    master_seed: int
    
    @property
    def data_split_seed(self) -> int:
        """Seed for train/val/test splitting."""
        return self._derive_seed("data_split")
    
    @property
    def model_init_seed(self) -> int:
        """Seed for model weight initialization."""
        return self._derive_seed("model_init")
    
    @property
    def training_seed(self) -> int:
        """Seed for training (shuffling, dropout, etc.)."""
        return self._derive_seed("training")
    
    @property
    def augmentation_seed(self) -> int:
        """Seed for data augmentation."""
        return self._derive_seed("augmentation")
    
    @property
    def cv_seed(self) -> int:
        """Seed for cross-validation folds."""
        return self._derive_seed("cv")
    
    def _derive_seed(self, component: str) -> int:
        """
        Derive a component-specific seed from master seed.
        Uses hash function for good distribution.
        """
        combined = f"{self.master_seed}:{component}"
        hash_bytes = hashlib.sha256(combined.encode()).digest()
        # Use first 4 bytes as integer seed
        return int.from_bytes(hash_bytes[:4], 'big') % (2**31)
    
    def to_dict(self) -> Dict[str, int]:
        """Export all seeds for logging/tracking."""
        return {
            'master_seed': self.master_seed,
            'data_split_seed': self.data_split_seed,
            'model_init_seed': self.model_init_seed,
            'training_seed': self.training_seed,
            'augmentation_seed': self.augmentation_seed,
            'cv_seed': self.cv_seed
        }
    
    def __repr__(self):
        return f"SeedConfig(master_seed={self.master_seed})"
 
 
# ============================================
# Usage Pattern
# ============================================
from sklearn.model_selection import train_test_split, StratifiedKFold
 
# Create seed configuration from single master seed
seeds = SeedConfig(master_seed=42)
print(f"Seed configuration: {seeds.to_dict()}")
 
# Use derived seeds for each component
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=seeds.data_split_seed,  # Component-specific seed
    stratify=y
)
 
cv = StratifiedKFold(
    n_splits=5,
    shuffle=True,
    random_state=seeds.cv_seed  # Different component, different seed
)
 
 
# ============================================
# Experiment Tracking with Seeds
# ============================================
class ExperimentTracker:
    """Track experiments with their complete seed configurations."""
    
    def __init__(self, experiment_name: str):
        self.experiment_name = experiment_name
        self.runs = []
    
    def log_run(
        self,
        seed_config: SeedConfig,
        hyperparams: Dict,
        metrics: Dict,
        notes: str = ""
    ):
        """Log a run with its seed configuration."""
        run_record = {
            'run_id': len(self.runs) + 1,
            'seeds': seed_config.to_dict(),
            'hyperparams': hyperparams,
            'metrics': metrics,
            'notes': notes
        }
        self.runs.append(run_record)
        return run_record
    
    def get_run(self, run_id: int) -> Optional[Dict]:
        """Retrieve a run by ID for reproduction."""
        for run in self.runs:
            if run['run_id'] == run_id:
                return run
        return None
    
    def reproduce_run(self, run_id: int) -> SeedConfig:
        """Get the seed config needed to reproduce a run."""
        run = self.get_run(run_id)
        if run:
            return SeedConfig(master_seed=run['seeds']['master_seed'])
        raise ValueError(f"Run {run_id} not found")
    
    def save(self, filepath: str):
        """Save experiment history to JSON."""
        with open(filepath, 'w') as f:
            json.dump({
                'experiment_name': self.experiment_name,
                'runs': self.runs
            }, f, indent=2)
 
 
# ============================================
# Multiple Independent Runs Pattern
# ============================================
def run_experiment_with_multiple_seeds(
    base_seed: int = 42,
    n_runs: int = 5
) -> list:
    """
    Run experiment multiple times with different seeds.
    Essential for understanding variance due to randomness.
    """
    all_results = []
    
    for run in range(n_runs):
        # Each run gets a different master seed
        run_seed = base_seed + run * 1000
        seeds = SeedConfig(master_seed=run_seed)
        
        # Set up reproducibility for this run
        set_seed(run_seed)
        
        # Run the experiment
        result = run_single_experiment(seeds)
        
        all_results.append({
            'run_id': run,
            'master_seed': run_seed,
            'metrics': result
        })
        
        print(f"Run {run+1}/{n_runs}: Accuracy = {result['accuracy']:.4f}")
    
    # Compute summary statistics
    accuracies = [r['metrics']['accuracy'] for r in all_results]
    print(f"\nSummary ({n_runs} runs):")
    print(f"  Mean: {np.mean(accuracies):.4f}")
    print(f"  Std:  {np.std(accuracies):.4f}")
    print(f"  Min:  {np.min(accuracies):.4f}")
    print(f"  Max:  {np.max(accuracies):.4f}")
    
    return all_results

The Power of Derived Seeds

Deriving component seeds from a master seed means you only need to record one number to reproduce an entire experiment. Changing the master seed changes everything consistently; same master seed gives identical experiments. This is the gold standard for ML reproducibility.

Deep Learning Reproducibility Challenges

Deep learning introduces additional reproducibility challenges beyond traditional ML. GPU operations, multi-threading, and non-deterministic optimizations create a perfect storm for irreproducibility.

deep_learning_reproducibility
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
import os
import random
import numpy as np
 
# ============================================
# PyTorch Full Reproducibility Setup
# ============================================
def setup_pytorch_reproducibility(seed: int, strict: bool = True):
    """
    Configure PyTorch for reproducible results.
    
    Parameters:
    -----------
    seed : Random seed
    strict : If True, use fully deterministic algorithms (slower)
    """
    import torch
    
    # Basic seeds
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    
    # CUDA seeds
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        
        if strict:
            # Deterministic cuDNN (slower but reproducible)
            torch.backends.cudnn.deterministic = True
            torch.backends.cudnn.benchmark = False
    
    # Python hash seed
    os.environ['PYTHONHASHSEED'] = str(seed)
    
    if strict:
        # PyTorch 1.8+ deterministic mode
        # Raises error for non-deterministic operations
        try:
            torch.use_deterministic_algorithms(True)
            # Environment variable for CUDA determinism
            os.environ['CUBLAS_WORKSPACE_CONFIG'] = ':4096:8'
        except AttributeError:
            # Older PyTorch version
            pass
    
    return seed
 
 
# ============================================
# TensorFlow/Keras Full Reproducibility Setup
# ============================================
def setup_tensorflow_reproducibility(seed: int, strict: bool = True):
    """
    Configure TensorFlow for reproducible results.
    """
    import tensorflow as tf
    
    # Basic seeds
    random.seed(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)
    
    # Python hash seed
    os.environ['PYTHONHASHSEED'] = str(seed)
    
    # TensorFlow determinism
    os.environ['TF_DETERMINISTIC_OPS'] = '1'
    
    if strict:
        # TensorFlow 2.8+ deterministic mode
        try:
            tf.config.experimental.enable_op_determinism()
        except AttributeError:
            pass
        
        # Disable GPU memory growth randomness
        try:
            gpus = tf.config.experimental.list_physical_devices('GPU')
            for gpu in gpus:
                tf.config.experimental.set_memory_growth(gpu, False)
        except:
            pass
    
    return seed
 
 
# ============================================
# Checking Reproducibility
# ============================================
def verify_pytorch_reproducibility(seed: int = 42, n_trials: int = 3):
    """
    Verify that PyTorch training is reproducible.
    """
    import torch
    import torch.nn as nn
    
    results = []
    
    for trial in range(n_trials):
        # Reset everything
        setup_pytorch_reproducibility(seed, strict=True)
        
        # Simple model and forward pass
        model = nn.Linear(10, 2)
        x = torch.randn(5, 10)
        output = model(x)
        
        # Record result
        results.append(output.detach().numpy().copy())
    
    # Check all trials match
    all_match = all(
        np.allclose(results[0], results[i])
        for i in range(1, n_trials)
    )
    
    print(f"Reproducibility check: {'PASS' if all_match else 'FAIL'}")
    if not all_match:
        print("First output:", results[0])
        for i in range(1, n_trials):
            print(f"Trial {i+1} output:", results[i])
    
    return all_match
 
 
# ============================================
# DataLoader Reproducibility
# ============================================
def create_reproducible_dataloader(
    dataset,
    batch_size: int,
    seed: int,
    num_workers: int = 0
):
    """
    Create a DataLoader with reproducible shuffling.
    
    Note: num_workers > 0 can break reproducibility
    unless worker_init_fn is properly set.
    """
    import torch
    from torch.utils.data import DataLoader
    
    def seed_worker(worker_id):
        """Initialize each worker with a seed."""
        worker_seed = seed + worker_id
        np.random.seed(worker_seed)
        random.seed(worker_seed)
    
    generator = torch.Generator()
    generator.manual_seed(seed)
    
    return DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=True,
        num_workers=num_workers,
        worker_init_fn=seed_worker,
        generator=generator,
        pin_memory=False  # True can cause non-determinism
    )

The Performance Cost of Determinism

Enabling strict determinism (cudnn.deterministic=True, use_deterministic_algorithms) can slow training by 10-30%. The cuDNN autotuner finds fast algorithms that may be non-deterministic. Trade off reproducibility against training time based on your needs.

Experiment Tracking and Logging

Beyond setting seeds, proper experiment tracking ensures you can always reproduce past results. This is essential for debugging, publication, and production deployment.

experiment_tracking
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
import json
import hashlib
from datetime import datetime
from dataclasses import dataclass, asdict
from typing import Dict, Any, Optional
import subprocess
 
@dataclass
class ExperimentConfig:
    """Complete configuration for reproducible experiments."""
    
    # Seeds
    master_seed: int
    
    # Data configuration
    data_path: str
    test_size: float
    stratify: bool
    
    # Model configuration  
    model_type: str
    hyperparameters: Dict[str, Any]
    
    # Training configuration
    epochs: int
    batch_size: int
    learning_rate: float
    
    # Environment information (auto-captured)
    git_commit: Optional[str] = None
    python_version: Optional[str] = None
    package_versions: Optional[Dict[str, str]] = None
    timestamp: Optional[str] = None
    
    def __post_init__(self):
        """Capture environment information."""
        self.timestamp = datetime.now().isoformat()
        self.python_version = self._get_python_version()
        self.git_commit = self._get_git_commit()
        self.package_versions = self._get_package_versions()
    
    def _get_python_version(self) -> str:
        import sys
        return sys.version
    
    def _get_git_commit(self) -> Optional[str]:
        try:
            result = subprocess.run(
                ['git', 'rev-parse', 'HEAD'],
                capture_output=True, text=True, check=True
            )
            return result.stdout.strip()
        except:
            return None
    
    def _get_package_versions(self) -> Dict[str, str]:
        """Capture versions of key packages."""
        versions = {}
        packages = ['numpy', 'pandas', 'sklearn', 'torch', 'tensorflow']
        
        for pkg in packages:
            try:
                module = __import__(pkg)
                versions[pkg] = getattr(module, '__version__', 'unknown')
            except ImportError:
                pass
        
        return versions
    
    def get_hash(self) -> str:
        """Get a hash of the configuration for identification."""
        config_str = json.dumps(asdict(self), sort_keys=True, default=str)
        return hashlib.sha256(config_str.encode()).hexdigest()[:16]
    
    def save(self, filepath: str):
        """Save configuration to JSON file."""
        with open(filepath, 'w') as f:
            json.dump(asdict(self), f, indent=2, default=str)
    
    @classmethod
    def load(cls, filepath: str) -> 'ExperimentConfig':
        """Load configuration from JSON file."""
        with open(filepath, 'r') as f:
            data = json.load(f)
        return cls(**data)
 
 
# ============================================
# Integration with MLflow (example)
# ============================================
def log_experiment_mlflow(config: ExperimentConfig, metrics: Dict[str, float]):
    """
    Log experiment to MLflow for tracking.
    """
    try:
        import mlflow
        
        with mlflow.start_run():
            # Log seeds
            mlflow.log_param("master_seed", config.master_seed)
            
            # Log configuration
            for key, value in config.hyperparameters.items():
                mlflow.log_param(f"hp_{key}", value)
            
            mlflow.log_param("model_type", config.model_type)
            mlflow.log_param("test_size", config.test_size)
            mlflow.log_param("epochs", config.epochs)
            mlflow.log_param("batch_size", config.batch_size)
            mlflow.log_param("learning_rate", config.learning_rate)
            
            # Log environment
            mlflow.log_param("git_commit", config.git_commit)
            
            # Log metrics
            for key, value in metrics.items():
                mlflow.log_metric(key, value)
            
            # Log full config as artifact
            config_path = f"/tmp/config_{config.get_hash()}.json"
            config.save(config_path)
            mlflow.log_artifact(config_path)
            
            print(f"Logged experiment: {config.get_hash()}")
            
    except ImportError:
        print("MLflow not installed, skipping logging")
 
 
# ============================================
# Usage Example
# ============================================
# Create experiment configuration
config = ExperimentConfig(
    master_seed=42,
    data_path="/data/my_dataset.csv",
    test_size=0.2,
    stratify=True,
    model_type="RandomForest",
    hyperparameters={
        "n_estimators": 100,
        "max_depth": 10,
        "min_samples_split": 5
    },
    epochs=1,  # Not used for RF but included for completeness
    batch_size=32,
    learning_rate=0.001
)
 
print(f"Experiment hash: {config.get_hash()}")
print(f"Git commit: {config.git_commit}")
print(f"Timestamp: {config.timestamp}")

Essential Elements to Track

•All random seeds — Master seed and any component-specific seeds
•Git commit hash — Exact code version used
•Package versions — All ML libraries with exact version numbers
•Data fingerprint — Hash or version of training/test data
•Full hyperparameters — Every parameter, not just the ones you varied
•Hardware info — GPU model, CUDA version if applicable
•Training logs — Loss curves, validation metrics over time
•Final metrics — All evaluation metrics, not just the headline number

Summary and Key Takeaways

Reproducibility through proper seed management is non-negotiable for professional ML. Let's consolidate the essential principles:

Core Principles

•Computers use pseudorandom numbers — Same seed always gives same sequence, enabling reproducibility
•ML has many random sources — Data splits, initialization, shuffling, algorithms—each needs seed control
•Set seeds comprehensively — Python random, NumPy, and framework-specific (PyTorch/TensorFlow) seeds
•Use derived seeds for components — Single master seed derives all component seeds for easy tracking
•GPU determinism costs performance — Strict reproducibility can slow training 10-30%
•Track everything for reproduction — Seeds, versions, commits, parameters—all logged together

The Researcher's Checklist:

☐ Master seed defined as a constant at script top ☐ All libraries seeded at program start ☐ Every random_state parameter explicitly set ☐ Experiment config includes all seeds ☐ Git commit recorded with each run ☐ Package versions pinned and recorded ☐ Multiple seeds run for statistical reporting ☐ Reproducibility verified with test runs

Looking Ahead: Limitations

While seeds enable reproducibility, holdout validation still has fundamental limitations: variance from the single split, wasted data in the test set, and sensitivity to the particular random split. The next page examines these limitations in detail, motivating cross-validation approaches.

Page Complete

You now understand random seeds and reproducibility at production depth—from PRNG theory to framework-specific setup to experiment tracking. Next, we'll examine the fundamental limitations of holdout validation, setting the stage for cross-validation methods.

4 / 5

Loading learning content...

Machine LearningCross-Validation & Resampling

Holdout Validation

LevelIntermediate

Duration90 mins

TopicCross-Validation & Resampling

4 / 5

Random Seeds

The Reproducibility Imperative

Reproducibility is the bedrock of scientific machine learning. Without it, we can't:

Debug experiments (can't fix what we can't reproduce)
Compare methods fairly (different runs give different results)
Collaborate effectively (others can't replicate our work)
Satisfy auditing requirements (regulators demand reproducibility)
Deploy confidently (training must match development)

The random seed is the key that unlocks reproducibility. Understanding how randomness works in ML—and how to control it—transforms chaotic experimentation into rigorous science.

What You Will Learn

How Randomness Works in Computing

Computers are deterministic machines—given the same inputs, they produce the same outputs. So how do they generate 'random' numbers? The answer lies in pseudorandom number generators (PRNGs).

The PRNG Concept

A PRNG is a deterministic algorithm that produces a sequence of numbers that appear random but are entirely determined by an initial value called the seed.

Mathematical Definition:

A PRNG defines a recurrence relation: $$X_{n+1} = f(X_n)$$

where:

$X_0$ is the seed (initial state)
$f$ is the transition function
Each $X_n$ determines the 'random' output

Given the same seed $X_0$, the sequence $X_0, X_1, X_2, \ldots$ is identical across runs.

Common PRNGs in ML:

Mersenne Twister (MT19937): Default in NumPy/Python. Period of $2^{19937}-1$. Not cryptographically secure but excellent statistical properties.
PCG (Permuted Congruential Generator): Modern, faster than MT, better statistical properties. Default in NumPy 1.17+.
Xorshift/Xoshiro: Very fast, good quality. Used in some frameworks.
Hardware RNG: True randomness from physical processes. Typically used only for seeding, not bulk generation.

prng_demonstration
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import numpy as np
 
# ============================================
# PRNG Fundamentals: Same Seed = Same Sequence
# ============================================
 
# Without setting seed - different results each run
print("Without seed (will vary):")
print(np.random.random(5))
 
# With seed - identical results every run
print("\nWith seed 42 (always identical):")
np.random.seed(42)
result1 = np.random.random(5)
print(result1)
 
# Reset and try again - exact same sequence
np.random.seed(42)
result2 = np.random.random(5)
print(result2)
 
print(f"\nArrays identical: {np.array_equal(result1, result2)}")
 
# ============================================
# The Sequence is Deterministic
# ============================================
np.random.seed(42)
sequence = []
for i in range(10):
    sequence.append(np.random.random())
    
print("\nSequential calls produce deterministic sequence:")
print(sequence[:5])
 
# Same result using array generation
np.random.seed(42)
array_result = np.random.random(10)
print(array_result[:5])
print(f"Sequential == Array: {np.allclose(sequence, array_result)}")
 
# ============================================
# Modern NumPy: Generator Objects
# ============================================
# Recommended approach since NumPy 1.17
# More explicit, better reproducibility
 
rng = np.random.default_rng(seed=42)
print("\nUsing Generator object:")
print(rng.random(5))
 
# Can create multiple independent generators
rng1 = np.random.default_rng(seed=1001)
rng2 = np.random.default_rng(seed=1002)
 
# These produce different sequences
print(f"\nGenerator 1: {rng1.random(3)}")
print(f"Generator 2: {rng2.random(3)}")

Pseudo vs. True Randomness

Sources of Randomness in ML Pipelines

Machine learning pipelines have randomness scattered throughout. Understanding all sources is essential for achieving true reproducibility.

The Many Faces of Randomness:

Sources of Randomness in ML Pipelines
Stage	Source	Affected By	Controlled By
Data Splitting	Train/val/test partition	Which samples in which set	`random_state` in `train_test_split`
Shuffling	Data order for training	Batch composition, gradient noise	`shuffle` seeds in data loaders
Initialization	Neural network weights	Starting point for optimization	Framework-specific initialization seeds
Dropout	Which neurons are dropped	Regularization pattern	Training seed (per-forward-pass)
Data Augmentation	Random transforms (flip, crop, etc.)	Training sample variation	Augmentation pipeline seeds
Stochastic Algorithms	Random forests, SGD batches	Model structure, gradient estimates	Algorithm-specific `random_state`
Cross-validation	Fold assignments	Which samples in which fold	`random_state` in CV splitters
Hyperparameter Search	Random/Bayesian sampling	Which configurations explored	Search method seeds

The Cascade Effect

Randomness early in the pipeline affects everything downstream:

Different data split → different training samples → different model
Different weight initialization → different optimization path → different model
Different shuffle order → different batches → different model

Even with 99% of the pipeline controlled, one uncontrolled source can cause variation.

Library-Specific Random States

Different libraries maintain separate random states:

Python's random module: Global state, used by some utilities
NumPy: Used by scikit-learn, pandas, and many others
PyTorch: Separate CPU and GPU states
TensorFlow: Session-level and global-level states
CUDA: GPU-specific randomness (cuDNN algorithms)

Controlling only NumPy doesn't control PyTorch or TensorFlow!

Hidden Randomness Sources

Setting Seeds Properly

Proper seed setting requires a comprehensive approach that covers all relevant libraries. Here's the production pattern.

comprehensive_seed_setting
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
import os
import random
import numpy as np
from typing import Optional
 
def set_seed(seed: int, deterministic: bool = False) -> None:
    """
    Set random seeds for reproducibility across all libraries.
    
    Parameters:
    -----------
    seed : Random seed (integer)
    deterministic : If True, use deterministic algorithms (may be slower)
    """
    # Python's built-in random
    random.seed(seed)
    
    # NumPy
    np.random.seed(seed)
    
    # Environment variable for hash randomization
    os.environ['PYTHONHASHSEED'] = str(seed)
    
    # PyTorch (if available)
    try:
        import torch
        torch.manual_seed(seed)
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)  # For multi-GPU
        
        if deterministic:
            # Deterministic algorithms (slower but reproducible)
            torch.backends.cudnn.deterministic = True
            torch.backends.cudnn.benchmark = False
            # For PyTorch 1.8+
            torch.use_deterministic_algorithms(True)
    except ImportError:
        pass
    
    # TensorFlow (if available)
    try:
        import tensorflow as tf
        tf.random.set_seed(seed)
        
        if deterministic:
            # TensorFlow 2.8+
            tf.config.experimental.enable_op_determinism()
    except ImportError:
        pass
    
    print(f"Seeds set to {seed}" + (" (deterministic mode)" if deterministic else ""))
 
 
def get_seeded_generator(seed: int) -> np.random.Generator:
    """
    Create an independent seeded generator for use in specific operations.
    Preferred over global seed for modular code.
    """
    return np.random.default_rng(seed)
 
 
# ============================================
# Usage Pattern: Script Entry Point
# ============================================
if __name__ == "__main__":
    # Set global seed at the very start
    GLOBAL_SEED = 42
    set_seed(GLOBAL_SEED)
    
    # For specific operations, create independent generators
    data_split_rng = get_seeded_generator(GLOBAL_SEED + 1)
    augmentation_rng = get_seeded_generator(GLOBAL_SEED + 2)
    
    # Now your experiment is reproducible
    # ...
 
 
# ============================================
# Scikit-learn Pattern: random_state Everywhere
# ============================================
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
 
SEED = 42
 
# Every scikit-learn operation that involves randomness
# should have random_state explicitly set
 
# Data splitting
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=SEED,  # Explicit seed
    stratify=y
)
 
# Cross-validation
cv = StratifiedKFold(
    n_splits=5,
    shuffle=True,
    random_state=SEED  # Explicit seed
)
 
# Models with randomness
rf = RandomForestClassifier(
    n_estimators=100,
    random_state=SEED  # Explicit seed
)
 
# Even deterministic-seeming models may have implementation randomness
lr = LogisticRegression(
    random_state=SEED,  # Solver randomness
    solver='saga'       # Stochastic solver
)

Seed Setting Best Practices

•Set seeds at the very start — Before any random operations occur, ideally as the first lines of main()
•Use explicit random_state everywhere — Never rely on global state; pass seeds to every function that accepts them
•Document the seed — Include in experiment logs, code comments, and published results
•Use different seeds for different purposes — e.g., seed+0 for splits, seed+1 for model, seed+2 for augmentation
•Test reproducibility — Run experiments twice and verify identical results before trusting your setup
•Avoid magic numbers — Define seeds as constants at the top of your code, not scattered literals

The Reproducibility Hierarchy

Reproducibility exists on a spectrum. Understanding the levels helps you make informed tradeoffs between reproducibility and performance.

Level 0: No Reproducibility No seeds set. Results vary unpredictably. Debugging is impossible.

Level 1: Statistical Reproducibility Multiple runs averaged. Individual runs vary, but aggregate statistics are stable. Useful for final reporting but hard to debug.

Level 2: Soft Reproducibility Seeds set for main sources (data splits, model initialization). Minor numerical differences may occur due to floating-point non-associativity or threading.

Level 3: Hard Reproducibility All seeds set, deterministic algorithms enabled. Results identical within same environment. Small performance cost.

Level 4: Cross-Environment Reproducibility Identical results across different machines, OS versions, library versions. Requires containerization and version pinning. Significant overhead.

Reproducibility Levels and Trade-offs
Level	Effort	Performance Impact	Use Case
Level 0	None	None	Prototyping only (not recommended)
Level 1	Low	None	Research reporting (aggregate results)
Level 2	Moderate	Minimal	Standard development and CI/CD
Level 3	High	5-20% slower	Debugging, regulated industries
Level 4	Very High	Variable	Publication, legal requirements

Choose the Right Level

Why Perfect Reproducibility is Hard

Floating-Point Non-Associativity: $(a + b) + c \neq a + (b + c)$ in floating-point arithmetic. Parallel reduction order can change results.
Hardware Differences: Different GPUs, CPU instruction sets, or even CPU generations can give different results.
Non-Deterministic Algorithms: Some optimized algorithms trade determinism for speed. cuDNN's autotuner selects different algorithms per run.
Library Version Differences: Algorithms change between versions. NumPy 1.16 and 1.20 may give different results for same seed.
Operating System Differences: File system ordering, thread scheduling, and default precision vary across platforms.

Practical Implication: Accept that bit-for-bit reproducibility is often impractical. Instead, aim for 'statistical reproducibility'—small numerical differences that don't affect conclusions.

Seed Management Strategies

As experiments grow complex with many random components, seed management becomes a design challenge. Here are production-tested patterns.

seed_management
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
from dataclasses import dataclass
from typing import Dict, Optional
import hashlib
import json
 
@dataclass
class SeedConfig:
    """
    Centralized seed configuration for reproducible experiments.
    Derives component-specific seeds from a master seed.
    """
    master_seed: int
    
    @property
    def data_split_seed(self) -> int:
        """Seed for train/val/test splitting."""
        return self._derive_seed("data_split")
    
    @property
    def model_init_seed(self) -> int:
        """Seed for model weight initialization."""
        return self._derive_seed("model_init")
    
    @property
    def training_seed(self) -> int:
        """Seed for training (shuffling, dropout, etc.)."""
        return self._derive_seed("training")
    
    @property
    def augmentation_seed(self) -> int:
        """Seed for data augmentation."""
        return self._derive_seed("augmentation")
    
    @property
    def cv_seed(self) -> int:
        """Seed for cross-validation folds."""
        return self._derive_seed("cv")
    
    def _derive_seed(self, component: str) -> int:
        """
        Derive a component-specific seed from master seed.
        Uses hash function for good distribution.
        """
        combined = f"{self.master_seed}:{component}"
        hash_bytes = hashlib.sha256(combined.encode()).digest()
        # Use first 4 bytes as integer seed
        return int.from_bytes(hash_bytes[:4], 'big') % (2**31)
    
    def to_dict(self) -> Dict[str, int]:
        """Export all seeds for logging/tracking."""
        return {
            'master_seed': self.master_seed,
            'data_split_seed': self.data_split_seed,
            'model_init_seed': self.model_init_seed,
            'training_seed': self.training_seed,
            'augmentation_seed': self.augmentation_seed,
            'cv_seed': self.cv_seed
        }
    
    def __repr__(self):
        return f"SeedConfig(master_seed={self.master_seed})"
 
 
# ============================================
# Usage Pattern
# ============================================
from sklearn.model_selection import train_test_split, StratifiedKFold
 
# Create seed configuration from single master seed
seeds = SeedConfig(master_seed=42)
print(f"Seed configuration: {seeds.to_dict()}")
 
# Use derived seeds for each component
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=seeds.data_split_seed,  # Component-specific seed
    stratify=y
)
 
cv = StratifiedKFold(
    n_splits=5,
    shuffle=True,
    random_state=seeds.cv_seed  # Different component, different seed
)
 
 
# ============================================
# Experiment Tracking with Seeds
# ============================================
class ExperimentTracker:
    """Track experiments with their complete seed configurations."""
    
    def __init__(self, experiment_name: str):
        self.experiment_name = experiment_name
        self.runs = []
    
    def log_run(
        self,
        seed_config: SeedConfig,
        hyperparams: Dict,
        metrics: Dict,
        notes: str = ""
    ):
        """Log a run with its seed configuration."""
        run_record = {
            'run_id': len(self.runs) + 1,
            'seeds': seed_config.to_dict(),
            'hyperparams': hyperparams,
            'metrics': metrics,
            'notes': notes
        }
        self.runs.append(run_record)
        return run_record
    
    def get_run(self, run_id: int) -> Optional[Dict]:
        """Retrieve a run by ID for reproduction."""
        for run in self.runs:
            if run['run_id'] == run_id:
                return run
        return None
    
    def reproduce_run(self, run_id: int) -> SeedConfig:
        """Get the seed config needed to reproduce a run."""
        run = self.get_run(run_id)
        if run:
            return SeedConfig(master_seed=run['seeds']['master_seed'])
        raise ValueError(f"Run {run_id} not found")
    
    def save(self, filepath: str):
        """Save experiment history to JSON."""
        with open(filepath, 'w') as f:
            json.dump({
                'experiment_name': self.experiment_name,
                'runs': self.runs
            }, f, indent=2)
 
 
# ============================================
# Multiple Independent Runs Pattern
# ============================================
def run_experiment_with_multiple_seeds(
    base_seed: int = 42,
    n_runs: int = 5
) -> list:
    """
    Run experiment multiple times with different seeds.
    Essential for understanding variance due to randomness.
    """
    all_results = []
    
    for run in range(n_runs):
        # Each run gets a different master seed
        run_seed = base_seed + run * 1000
        seeds = SeedConfig(master_seed=run_seed)
        
        # Set up reproducibility for this run
        set_seed(run_seed)
        
        # Run the experiment
        result = run_single_experiment(seeds)
        
        all_results.append({
            'run_id': run,
            'master_seed': run_seed,
            'metrics': result
        })
        
        print(f"Run {run+1}/{n_runs}: Accuracy = {result['accuracy']:.4f}")
    
    # Compute summary statistics
    accuracies = [r['metrics']['accuracy'] for r in all_results]
    print(f"\nSummary ({n_runs} runs):")
    print(f"  Mean: {np.mean(accuracies):.4f}")
    print(f"  Std:  {np.std(accuracies):.4f}")
    print(f"  Min:  {np.min(accuracies):.4f}")
    print(f"  Max:  {np.max(accuracies):.4f}")
    
    return all_results

The Power of Derived Seeds

Deep Learning Reproducibility Challenges

deep_learning_reproducibility
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
import os
import random
import numpy as np
 
# ============================================
# PyTorch Full Reproducibility Setup
# ============================================
def setup_pytorch_reproducibility(seed: int, strict: bool = True):
    """
    Configure PyTorch for reproducible results.
    
    Parameters:
    -----------
    seed : Random seed
    strict : If True, use fully deterministic algorithms (slower)
    """
    import torch
    
    # Basic seeds
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    
    # CUDA seeds
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        
        if strict:
            # Deterministic cuDNN (slower but reproducible)
            torch.backends.cudnn.deterministic = True
            torch.backends.cudnn.benchmark = False
    
    # Python hash seed
    os.environ['PYTHONHASHSEED'] = str(seed)
    
    if strict:
        # PyTorch 1.8+ deterministic mode
        # Raises error for non-deterministic operations
        try:
            torch.use_deterministic_algorithms(True)
            # Environment variable for CUDA determinism
            os.environ['CUBLAS_WORKSPACE_CONFIG'] = ':4096:8'
        except AttributeError:
            # Older PyTorch version
            pass
    
    return seed
 
 
# ============================================
# TensorFlow/Keras Full Reproducibility Setup
# ============================================
def setup_tensorflow_reproducibility(seed: int, strict: bool = True):
    """
    Configure TensorFlow for reproducible results.
    """
    import tensorflow as tf
    
    # Basic seeds
    random.seed(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)
    
    # Python hash seed
    os.environ['PYTHONHASHSEED'] = str(seed)
    
    # TensorFlow determinism
    os.environ['TF_DETERMINISTIC_OPS'] = '1'
    
    if strict:
        # TensorFlow 2.8+ deterministic mode
        try:
            tf.config.experimental.enable_op_determinism()
        except AttributeError:
            pass
        
        # Disable GPU memory growth randomness
        try:
            gpus = tf.config.experimental.list_physical_devices('GPU')
            for gpu in gpus:
                tf.config.experimental.set_memory_growth(gpu, False)
        except:
            pass
    
    return seed
 
 
# ============================================
# Checking Reproducibility
# ============================================
def verify_pytorch_reproducibility(seed: int = 42, n_trials: int = 3):
    """
    Verify that PyTorch training is reproducible.
    """
    import torch
    import torch.nn as nn
    
    results = []
    
    for trial in range(n_trials):
        # Reset everything
        setup_pytorch_reproducibility(seed, strict=True)
        
        # Simple model and forward pass
        model = nn.Linear(10, 2)
        x = torch.randn(5, 10)
        output = model(x)
        
        # Record result
        results.append(output.detach().numpy().copy())
    
    # Check all trials match
    all_match = all(
        np.allclose(results[0], results[i])
        for i in range(1, n_trials)
    )
    
    print(f"Reproducibility check: {'PASS' if all_match else 'FAIL'}")
    if not all_match:
        print("First output:", results[0])
        for i in range(1, n_trials):
            print(f"Trial {i+1} output:", results[i])
    
    return all_match
 
 
# ============================================
# DataLoader Reproducibility
# ============================================
def create_reproducible_dataloader(
    dataset,
    batch_size: int,
    seed: int,
    num_workers: int = 0
):
    """
    Create a DataLoader with reproducible shuffling.
    
    Note: num_workers > 0 can break reproducibility
    unless worker_init_fn is properly set.
    """
    import torch
    from torch.utils.data import DataLoader
    
    def seed_worker(worker_id):
        """Initialize each worker with a seed."""
        worker_seed = seed + worker_id
        np.random.seed(worker_seed)
        random.seed(worker_seed)
    
    generator = torch.Generator()
    generator.manual_seed(seed)
    
    return DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=True,
        num_workers=num_workers,
        worker_init_fn=seed_worker,
        generator=generator,
        pin_memory=False  # True can cause non-determinism
    )

The Performance Cost of Determinism

Experiment Tracking and Logging

Beyond setting seeds, proper experiment tracking ensures you can always reproduce past results. This is essential for debugging, publication, and production deployment.

experiment_tracking
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
import json
import hashlib
from datetime import datetime
from dataclasses import dataclass, asdict
from typing import Dict, Any, Optional
import subprocess
 
@dataclass
class ExperimentConfig:
    """Complete configuration for reproducible experiments."""
    
    # Seeds
    master_seed: int
    
    # Data configuration
    data_path: str
    test_size: float
    stratify: bool
    
    # Model configuration  
    model_type: str
    hyperparameters: Dict[str, Any]
    
    # Training configuration
    epochs: int
    batch_size: int
    learning_rate: float
    
    # Environment information (auto-captured)
    git_commit: Optional[str] = None
    python_version: Optional[str] = None
    package_versions: Optional[Dict[str, str]] = None
    timestamp: Optional[str] = None
    
    def __post_init__(self):
        """Capture environment information."""
        self.timestamp = datetime.now().isoformat()
        self.python_version = self._get_python_version()
        self.git_commit = self._get_git_commit()
        self.package_versions = self._get_package_versions()
    
    def _get_python_version(self) -> str:
        import sys
        return sys.version
    
    def _get_git_commit(self) -> Optional[str]:
        try:
            result = subprocess.run(
                ['git', 'rev-parse', 'HEAD'],
                capture_output=True, text=True, check=True
            )
            return result.stdout.strip()
        except:
            return None
    
    def _get_package_versions(self) -> Dict[str, str]:
        """Capture versions of key packages."""
        versions = {}
        packages = ['numpy', 'pandas', 'sklearn', 'torch', 'tensorflow']
        
        for pkg in packages:
            try:
                module = __import__(pkg)
                versions[pkg] = getattr(module, '__version__', 'unknown')
            except ImportError:
                pass
        
        return versions
    
    def get_hash(self) -> str:
        """Get a hash of the configuration for identification."""
        config_str = json.dumps(asdict(self), sort_keys=True, default=str)
        return hashlib.sha256(config_str.encode()).hexdigest()[:16]
    
    def save(self, filepath: str):
        """Save configuration to JSON file."""
        with open(filepath, 'w') as f:
            json.dump(asdict(self), f, indent=2, default=str)
    
    @classmethod
    def load(cls, filepath: str) -> 'ExperimentConfig':
        """Load configuration from JSON file."""
        with open(filepath, 'r') as f:
            data = json.load(f)
        return cls(**data)
 
 
# ============================================
# Integration with MLflow (example)
# ============================================
def log_experiment_mlflow(config: ExperimentConfig, metrics: Dict[str, float]):
    """
    Log experiment to MLflow for tracking.
    """
    try:
        import mlflow
        
        with mlflow.start_run():
            # Log seeds
            mlflow.log_param("master_seed", config.master_seed)
            
            # Log configuration
            for key, value in config.hyperparameters.items():
                mlflow.log_param(f"hp_{key}", value)
            
            mlflow.log_param("model_type", config.model_type)
            mlflow.log_param("test_size", config.test_size)
            mlflow.log_param("epochs", config.epochs)
            mlflow.log_param("batch_size", config.batch_size)
            mlflow.log_param("learning_rate", config.learning_rate)
            
            # Log environment
            mlflow.log_param("git_commit", config.git_commit)
            
            # Log metrics
            for key, value in metrics.items():
                mlflow.log_metric(key, value)
            
            # Log full config as artifact
            config_path = f"/tmp/config_{config.get_hash()}.json"
            config.save(config_path)
            mlflow.log_artifact(config_path)
            
            print(f"Logged experiment: {config.get_hash()}")
            
    except ImportError:
        print("MLflow not installed, skipping logging")
 
 
# ============================================
# Usage Example
# ============================================
# Create experiment configuration
config = ExperimentConfig(
    master_seed=42,
    data_path="/data/my_dataset.csv",
    test_size=0.2,
    stratify=True,
    model_type="RandomForest",
    hyperparameters={
        "n_estimators": 100,
        "max_depth": 10,
        "min_samples_split": 5
    },
    epochs=1,  # Not used for RF but included for completeness
    batch_size=32,
    learning_rate=0.001
)
 
print(f"Experiment hash: {config.get_hash()}")
print(f"Git commit: {config.git_commit}")
print(f"Timestamp: {config.timestamp}")

Essential Elements to Track

•All random seeds — Master seed and any component-specific seeds
•Git commit hash — Exact code version used
•Package versions — All ML libraries with exact version numbers
•Data fingerprint — Hash or version of training/test data
•Full hyperparameters — Every parameter, not just the ones you varied
•Hardware info — GPU model, CUDA version if applicable
•Training logs — Loss curves, validation metrics over time
•Final metrics — All evaluation metrics, not just the headline number

Summary and Key Takeaways

Reproducibility through proper seed management is non-negotiable for professional ML. Let's consolidate the essential principles:

Core Principles

•Computers use pseudorandom numbers — Same seed always gives same sequence, enabling reproducibility
•ML has many random sources — Data splits, initialization, shuffling, algorithms—each needs seed control
•Set seeds comprehensively — Python random, NumPy, and framework-specific (PyTorch/TensorFlow) seeds
•Use derived seeds for components — Single master seed derives all component seeds for easy tracking
•GPU determinism costs performance — Strict reproducibility can slow training 10-30%
•Track everything for reproduction — Seeds, versions, commits, parameters—all logged together

The Researcher's Checklist:

Looking Ahead: Limitations

Page Complete

4 / 5