Loading learning content...
Every machine learning project ultimately faces a fundamental question: Will this model work on new data?
A model that perfectly memorizes training examples but fails on fresh data is worthless. A model that makes predictions we can trust on unseen examples is valuable. The difference between these outcomes is generalization—the ability to extend what's learned from observed data to unobserved data.
But here's the challenge: by definition, we cannot directly measure performance on truly unseen data because... we haven't seen it yet. This creates an epistemological puzzle: How can we estimate future performance using only past data?
The answer to this puzzle—one of the most important methodological insights in machine learning—is the disciplined separation of data into training, validation, and test sets. This seemingly simple idea has profound implications for how we build, evaluate, and deploy ML models.
Improper data splits lead to overly optimistic performance estimates, models that fail in production, and wasted resources chasing phantom improvements. Perhaps worse, they erode trust in machine learning itself. Understanding data splits isn't optional—it's foundational to credible ML practice.
To understand why data splits matter, we must first understand what happens without them.
The Overfitting Scenario:
Imagine training a model on 1,000 labeled examples and then evaluating its accuracy on... those same 1,000 examples. What could go wrong?
Everything.
A sufficiently complex model can achieve perfect accuracy on training data by simply memorizing every example. Consider a decision tree with no depth limit: it can create a leaf node for every training point, perfectly classifying each one. A k-nearest neighbors model with k=1 will classify every training point correctly (its nearest neighbor is itself). A high-degree polynomial can pass through every data point exactly.
This perfect training performance tells us nothing about how the model will behave on new data. In fact, these memorized models typically perform terribly on new data because they've learned the noise and idiosyncrasies of the training set rather than the underlying patterns.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.preprocessing import PolynomialFeaturesfrom sklearn.linear_model import LinearRegressionfrom sklearn.metrics import mean_squared_error # Generate synthetic data: y = sin(x) + noisenp.random.seed(42)n_samples = 20X = np.linspace(0, 3, n_samples).reshape(-1, 1)y_true = np.sin(2 * X).ravel()y = y_true + 0.1 * np.random.randn(n_samples) # Add noise # Fit models of increasing complexitydegrees = [1, 3, 15]X_plot = np.linspace(0, 3, 100).reshape(-1, 1) fig, axes = plt.subplots(1, 3, figsize=(15, 4))for ax, degree in zip(axes, degrees): # Create polynomial features poly = PolynomialFeatures(degree) X_poly = poly.fit_transform(X) X_plot_poly = poly.transform(X_plot) # Fit model model = LinearRegression() model.fit(X_poly, y) # Evaluate on training data (THIS IS THE MISTAKE) y_pred_train = model.predict(X_poly) train_mse = mean_squared_error(y, y_pred_train) # Plot ax.scatter(X, y, color='blue', s=50, label='Training data') ax.plot(X_plot, model.predict(X_plot_poly), 'r-', linewidth=2, label='Model') ax.set_title(f'Degree {degree}\nTrain MSE: {train_mse:.4f}') ax.legend() plt.tight_layout()plt.suptitle('Lower Training Error ≠ Better Model!', y=1.02, fontsize=14)plt.show() # The degree-15 polynomial has near-zero training error# But it will perform terribly on new data!print("Lesson: Training error is a BIASED estimate of true performance")print("The more complex the model, the more optimistic (and misleading) training error becomes")Training error is a biased and overly optimistic estimate of true performance. To get an honest estimate, we MUST evaluate on data the model has never seen during training. This is not optional—it's mathematically necessary.
The simplest solution to the memorization problem is the holdout method: split your data into two disjoint sets.
Training Set:
Test Set:
Formally, given dataset $\mathcal{D}$ with $n$ examples:
$$\mathcal{D} = \mathcal{D}{train} \cup \mathcal{D}{test}, \quad \mathcal{D}{train} \cap \mathcal{D}{test} = \emptyset$$
Common split ratios:
123456789101112131415161718192021222324252627282930313233343536373839404142
from sklearn.model_selection import train_test_splitfrom sklearn.datasets import load_irisfrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import accuracy_score # Load dataX, y = load_iris(return_X_y=True)print(f"Total samples: {len(X)}") # Split data: 80% train, 20% test# random_state ensures reproducibility# stratify=y ensures class balance is preserved in both setsX_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, # 20% for testing random_state=42, # Reproducibility stratify=y # Maintain class proportions) print(f"Training samples: {len(X_train)} ({len(X_train)/len(X)*100:.0f}%)")print(f"Test samples: {len(X_test)} ({len(X_test)/len(X)*100:.0f}%)") # Verify stratificationimport numpy as npprint(f"\nClass distribution in full data: {np.bincount(y)}")print(f"Class distribution in training: {np.bincount(y_train)}")print(f"Class distribution in test: {np.bincount(y_test)}") # Train model ONLY on training datamodel = LogisticRegression(max_iter=200)model.fit(X_train, y_train) # Evaluate on BOTH sets to see the differencetrain_acc = accuracy_score(y_train, model.predict(X_train))test_acc = accuracy_score(y_test, model.predict(X_test)) print(f"\nTraining accuracy: {train_acc:.4f}")print(f"Test accuracy: {test_acc:.4f}")print(f"Gap: {train_acc - test_acc:.4f} (smaller is better)") # If gap is large → model is overfitting# If both are low → model is underfittingThe test set must remain untouched until final evaluation. If you use test set performance to make ANY decisions—algorithm choice, hyperparameter tuning, feature selection—you corrupt its validity. The test set becomes part of your 'training' process, and its error estimate becomes biased.
The train-test split has a critical limitation: we can't use the test set to make decisions.
But in practice, we need to make many decisions:
If we use the test set to answer these questions, we're implicitly fitting to the test set, and our performance estimate becomes optimistically biased.
The Solution: A Three-Way Split
We introduce a third subset—the validation set (also called development set or dev set):
$$\mathcal{D} = \mathcal{D}{train} \cup \mathcal{D}{val} \cup \mathcal{D}_{test}$$
All three sets must be mutually exclusive: $$\mathcal{D}{train} \cap \mathcal{D}{val} = \emptyset, \quad \mathcal{D}{train} \cap \mathcal{D}{test} = \emptyset, \quad \mathcal{D}{val} \cap \mathcal{D}{test} = \emptyset$$
| Set | Typical Size | Purpose | How Often Used | Who 'Sees' It |
|---|---|---|---|---|
| Training | 60-80% | Learn model parameters | Many times (each epoch) | The learning algorithm |
| Validation | 10-20% | Tune hyperparameters, select models, early stopping | Many times (after each experiment) | The engineer (indirectly) |
| Test | 10-20% | Final, unbiased performance estimate | Once (at the very end) | No one until final evaluation |
The Workflow:
This workflow ensures that your final performance estimate comes from data that was never involved in any decision-making process.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
import numpy as npfrom sklearn.model_selection import train_test_splitfrom sklearn.datasets import make_classificationfrom sklearn.linear_model import LogisticRegressionfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.svm import SVCfrom sklearn.metrics import accuracy_score # Generate synthetic dataX, y = make_classification(n_samples=1000, n_features=20, n_informative=10, random_state=42) # First split: separate test set (final evaluation only)X_temp, X_test, y_temp, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y) # Second split: separate validation set from trainingX_train, X_val, y_train, y_val = train_test_split( X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp)# 0.25 of 0.8 = 0.2, so we get 60/20/20 split print(f"Training set: {len(X_train)} samples (60%)")print(f"Validation set: {len(X_val)} samples (20%)")print(f"Test set: {len(X_test)} samples (20%)") # Step 1: Train multiple models on TRAINING data onlymodels = { 'Logistic Regression': LogisticRegression(max_iter=1000), 'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42), 'SVM': SVC(kernel='rbf', random_state=42)} # Step 2: Evaluate on VALIDATION set to choose best modelprint("\n--- Model Selection (using validation set) ---")val_scores = {}for name, model in models.items(): model.fit(X_train, y_train) val_pred = model.predict(X_val) val_scores[name] = accuracy_score(y_val, val_pred) print(f"{name}: Validation Accuracy = {val_scores[name]:.4f}") # Step 3: Select best model based on validation performancebest_model_name = max(val_scores, key=val_scores.get)print(f"\nBest model: {best_model_name}") # Step 4: Retrain on train + validation combinedX_train_full = np.vstack([X_train, X_val])y_train_full = np.concatenate([y_train, y_val]) final_model = models[best_model_name].__class__(**models[best_model_name].get_params())final_model.fit(X_train_full, y_train_full) # Step 5: Final evaluation on TEST set (only once!)test_pred = final_model.predict(X_test)test_accuracy = accuracy_score(y_test, test_pred) print(f"\n--- Final Evaluation (TEST SET - use only once!) ---")print(f"Test Accuracy: {test_accuracy:.4f}")print("This is our best estimate of real-world performance.")Once you've selected your best model using the validation set, that validation data has served its purpose. You can now merge it with training data to train the final model on 80% of your data instead of 60%, potentially improving performance. The test set remains pristine for final evaluation.
The holdout approach has a significant limitation: the validation estimate is based on a single random split. If you're unlucky, an 'easy' subset ends up in validation, giving optimistic estimates—or a 'hard' subset gives pessimistic estimates.
Cross-Validation (CV) addresses this by systematically using all data for both training and validation.
K-Fold Cross-Validation:
This gives a more robust estimate because every data point is used for validation exactly once, and the estimate is less dependent on any single random split.
| Method | Description | When to Use | Pros | Cons |
|---|---|---|---|---|
| K-Fold CV | Split into k folds, rotate validation | General purpose, k=5 or 10 | Robust, efficient | Moderate computation |
| Stratified K-Fold | K-Fold preserving class proportions | Classification with imbalance | Better for classification | Slightly more complex |
| Leave-One-Out (LOO) | K-Fold where k=n | Very small datasets | Maximum training data | High variance, slow |
| Repeated K-Fold | K-Fold repeated multiple times | When estimate variance matters | Lower variance estimate | More computation |
| Time Series CV | Expanding/sliding window | Sequential/temporal data | Respects time ordering | Less training data early on |
| Group K-Fold | Keep groups together | When samples aren't independent | Prevents data leakage | Requires group labels |
1234567891011121314151617181920212223242526272829303132333435363738
from sklearn.model_selection import ( cross_val_score, KFold, StratifiedKFold, LeaveOneOut, TimeSeriesSplit)from sklearn.datasets import load_irisfrom sklearn.linear_model import LogisticRegressionimport numpy as np X, y = load_iris(return_X_y=True)model = LogisticRegression(max_iter=200) # Standard K-Fold (k=5)kfold = KFold(n_splits=5, shuffle=True, random_state=42)scores_kfold = cross_val_score(model, X, y, cv=kfold)print(f"5-Fold CV: {scores_kfold.mean():.4f} ± {scores_kfold.std():.4f}")print(f" Individual fold scores: {scores_kfold.round(4)}") # Stratified K-Fold (maintains class proportions in each fold)skfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)scores_strat = cross_val_score(model, X, y, cv=skfold)print(f"\nStratified 5-Fold CV: {scores_strat.mean():.4f} ± {scores_strat.std():.4f}") # 10-Fold (more folds = more training data per fold, but more variance)scores_10fold = cross_val_score(model, X, y, cv=10)print(f"\n10-Fold CV: {scores_10fold.mean():.4f} ± {scores_10fold.std():.4f}") # Leave-One-Out (extreme case: k = n)# Only practical for small datasets!loo = LeaveOneOut()scores_loo = cross_val_score(model, X, y, cv=loo)print(f"\nLeave-One-Out CV: {scores_loo.mean():.4f} (n={len(scores_loo)} folds)") # Confidence interval (approximate 95% CI)mean = scores_kfold.mean()std = scores_kfold.std()ci_lower = mean - 1.96 * std / np.sqrt(len(scores_kfold))ci_upper = mean + 1.96 * std / np.sqrt(len(scores_kfold))print(f"\n95% CI for 5-Fold CV: [{ci_lower:.4f}, {ci_upper:.4f}]")Cross-validation should be performed on your train+validation data only. Your test set remains held out. CV helps you choose the best model, but the final performance estimate still comes from the test set.
How you split data matters as much as that you split it. Poor splitting strategies can introduce bias or fail to represent the true data distribution.
Random Splitting:
The default approach: randomly assign each sample to train/val/test. Works well when:
Stratified Splitting:
Ensures that the class distribution (or target distribution) is preserved in each split. Essential when:
1234567891011121314151617181920212223242526272829
import numpy as npfrom sklearn.model_selection import train_test_split # Create imbalanced dataset: 95% class 0, 5% class 1np.random.seed(42)n = 200y = np.array([0] * 190 + [1] * 10) # 95% vs 5%X = np.random.randn(n, 5) print("Original class distribution:", np.bincount(y))print(f"Class 0: {np.mean(y==0)*100:.1f}%, Class 1: {np.mean(y==1)*100:.1f}%") # Non-stratified split (risky!)print("\n--- Random Split (non-stratified) ---")for seed in [0, 1, 2, 3, 4]: _, _, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=seed # No stratify! ) print(f"Seed {seed}: Test class 1 count = {sum(y_test==1)}/{len(y_test)}")# Notice: Some splits might have 0 or very few class 1 samples in test! # Stratified split (safe!)print("\n--- Stratified Split ---")for seed in [0, 1, 2, 3, 4]: _, _, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=seed, stratify=y # Stratified! ) print(f"Seed {seed}: Test class 1 count = {sum(y_test==1)}/{len(y_test)}")# Stratified: Consistent class proportions in every splitSpecial Splitting Strategies for Special Data:
Not all data can be randomly split. Some domains require specialized strategies:
If correlated samples end up in both train and test, you're not evaluating generalization—you're evaluating memorization in disguise. For example, if multiple lab readings from the same patient appear in both sets, the model might learn patient-specific patterns rather than disease patterns.
When you use cross-validation to select hyperparameters, you introduce a subtle bias: you're choosing hyperparameters that perform well on the validation folds. The reported CV score is optimistically biased.
Nested Cross-Validation solves this with two levels of CV:
For each outer fold:
Set aside outer fold as test
On remaining data:
Run inner CV to select best hyperparameters
Train model with best hyperparameters on all inner data
Evaluate on outer fold (test)
Average outer fold scores = unbiased performance estimate
This provides an unbiased estimate of optimized model performance—critical for scientific reporting.
1234567891011121314151617181920212223242526272829303132333435363738394041
from sklearn.model_selection import cross_val_score, KFold, GridSearchCVfrom sklearn.svm import SVCfrom sklearn.datasets import load_breast_cancerimport numpy as np X, y = load_breast_cancer(return_X_y=True) # Hyperparameter gridparam_grid = { 'C': [0.1, 1, 10, 100], 'gamma': ['scale', 'auto', 0.01, 0.1]} # WRONG: Non-nested CV (biased estimate)# This finds best params and reports the same CV scoreprint("=== Non-Nested CV (BIASED) ===")inner_cv = KFold(n_splits=5, shuffle=True, random_state=42)grid_search = GridSearchCV(SVC(), param_grid, cv=inner_cv, scoring='accuracy')grid_search.fit(X, y)print(f"Best params: {grid_search.best_params_}")print(f"Best CV score: {grid_search.best_score_:.4f}")print("⚠️ This score is optimistically biased!") # CORRECT: Nested CV (unbiased estimate)print("\n=== Nested CV (UNBIASED) ===")outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)inner_cv = KFold(n_splits=3, shuffle=True, random_state=42) # 3 inner for speed # Inner loop: hyperparameter tuningclf = GridSearchCV(SVC(), param_grid, cv=inner_cv, scoring='accuracy') # Outer loop: performance estimationnested_scores = cross_val_score(clf, X, y, cv=outer_cv, scoring='accuracy') print(f"Nested CV scores: {nested_scores.round(4)}")print(f"Mean: {nested_scores.mean():.4f} ± {nested_scores.std():.4f}")print("✓ This is an unbiased estimate of tuned model performance") # The nested score is typically lower (less optimistic) than non-nestedprint(f"\nDifference: {grid_search.best_score_ - nested_scores.mean():.4f}")print("(Positive difference = non-nested was overly optimistic)")Nested CV is computationally expensive (outer_k × inner_k × hyperparameter_combinations model fits). Use it when: • Reporting results in scientific papers • Comparing algorithms fairly • Dataset is too small for a separate test set • You need an unbiased estimate with hyperparameter tuning
With the theory established, let's consolidate practical recommendations for different scenarios:
| Dataset Size | Recommended Approach | Rationale |
|---|---|---|
| Very Small (<100) | Leave-One-Out CV or 5-fold + nested CV | Every sample matters; need robust estimation |
| Small (100-1,000) | Stratified 10-fold CV + separate test set | Balance between training data and estimation reliability |
| Medium (1,000-10,000) | 70/15/15 or 80/10/10 split with stratification | Enough data for reliable held-out estimates |
| Large (10,000-100,000) | 80/10/10 or 90/5/5 split | Test set of thousands gives tight confidence intervals |
| Very Large (>100,000) | Even 99/0.5/0.5 is often fine | 0.5% of 1M is 5,000 samples—plenty for evaluation |
"I'll just check the test set once to see if I'm on the right track." This is the beginning of p-hacking your model. Each peek affects your future decisions. Each decision makes the test score more optimistic. By the time you 'officially' evaluate, you've already implicitly fit to the test set. Discipline is essential.
We've established the fundamental methodology for evaluating machine learning models. This isn't optional infrastructure—it's the scientific foundation that makes ML claims credible.
What's Next:
We've learned how to evaluate whether a model works. But what exactly is a model trying to do? The next page introduces the hypothesis space—the set of all possible functions a learning algorithm considers, and how this choice shapes learning.
You now understand the methodology of model evaluation: training, validation, and test sets; cross-validation; stratification; and nested CV. These aren't just techniques—they're the scientific method applied to machine learning. Next, we'll explore the hypothesis space.