Machine LearningKey Concepts and Terminology

Key Concepts and Terminology

LevelBeginner

Duration90 mins

TopicKey Concepts and Terminology

2 / 5

Training, Validation, and Test Sets: The Foundation of Model Evaluation

The Central Question of Machine Learning

Every machine learning project ultimately faces a fundamental question: Will this model work on new data?

A model that perfectly memorizes training examples but fails on fresh data is worthless. A model that makes predictions we can trust on unseen examples is valuable. The difference between these outcomes is generalization—the ability to extend what's learned from observed data to unobserved data.

But here's the challenge: by definition, we cannot directly measure performance on truly unseen data because... we haven't seen it yet. This creates an epistemological puzzle: How can we estimate future performance using only past data?

The answer to this puzzle—one of the most important methodological insights in machine learning—is the disciplined separation of data into training, validation, and test sets. This seemingly simple idea has profound implications for how we build, evaluate, and deploy ML models.

The Cost of Getting This Wrong

Improper data splits lead to overly optimistic performance estimates, models that fail in production, and wasted resources chasing phantom improvements. Perhaps worse, they erode trust in machine learning itself. Understanding data splits isn't optional—it's foundational to credible ML practice.

The Memorization Problem

To understand why data splits matter, we must first understand what happens without them.

The Overfitting Scenario:

Imagine training a model on 1,000 labeled examples and then evaluating its accuracy on... those same 1,000 examples. What could go wrong?

Everything.

A sufficiently complex model can achieve perfect accuracy on training data by simply memorizing every example. Consider a decision tree with no depth limit: it can create a leaf node for every training point, perfectly classifying each one. A k-nearest neighbors model with k=1 will classify every training point correctly (its nearest neighbor is itself). A high-degree polynomial can pass through every data point exactly.

This perfect training performance tells us nothing about how the model will behave on new data. In fact, these memorized models typically perform terribly on new data because they've learned the noise and idiosyncrasies of the training set rather than the underlying patterns.

overfitting_demonstration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
 
# Generate synthetic data: y = sin(x) + noise
np.random.seed(42)
n_samples = 20
X = np.linspace(0, 3, n_samples).reshape(-1, 1)
y_true = np.sin(2 * X).ravel()
y = y_true + 0.1 * np.random.randn(n_samples)  # Add noise
 
# Fit models of increasing complexity
degrees = [1, 3, 15]
X_plot = np.linspace(0, 3, 100).reshape(-1, 1)
 
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for ax, degree in zip(axes, degrees):
    # Create polynomial features
    poly = PolynomialFeatures(degree)
    X_poly = poly.fit_transform(X)
    X_plot_poly = poly.transform(X_plot)
    
    # Fit model
    model = LinearRegression()
    model.fit(X_poly, y)
    
    # Evaluate on training data (THIS IS THE MISTAKE)
    y_pred_train = model.predict(X_poly)
    train_mse = mean_squared_error(y, y_pred_train)
    
    # Plot
    ax.scatter(X, y, color='blue', s=50, label='Training data')
    ax.plot(X_plot, model.predict(X_plot_poly), 'r-', linewidth=2, label='Model')
    ax.set_title(f'Degree {degree}\nTrain MSE: {train_mse:.4f}')
    ax.legend()
 
plt.tight_layout()
plt.suptitle('Lower Training Error ≠ Better Model!', y=1.02, fontsize=14)
plt.show()
 
# The degree-15 polynomial has near-zero training error
# But it will perform terribly on new data!
print("Lesson: Training error is a BIASED estimate of true performance")
print("The more complex the model, the more optimistic (and misleading) training error becomes")

The Fundamental Insight

Training error is a biased and overly optimistic estimate of true performance. To get an honest estimate, we MUST evaluate on data the model has never seen during training. This is not optional—it's mathematically necessary.

The Train-Test Split: Honest Evaluation

The simplest solution to the memorization problem is the holdout method: split your data into two disjoint sets.

Training Set:

Used to fit the model parameters
The model 'sees' this data and learns from it
Larger training sets generally lead to better models

Test Set:

Held out during training—the model never sees these examples
Used to estimate how the model will perform on new data
Must remain untouched until final evaluation

Formally, given dataset $\mathcal{D}$ with $n$ examples:

$$\mathcal{D} = \mathcal{D}{train} \cup \mathcal{D}{test}, \quad \mathcal{D}{train} \cap \mathcal{D}{test} = \emptyset$$

Common split ratios:

80/20: 80% training, 20% test (most common)
70/30: More conservative, larger test set for tighter confidence intervals
90/10: When data is scarce and you need more training examples
99/1 or similar: Very large datasets (millions of examples) where even 1% is substantial

Training Set Purpose

•Fit model parameters (weights, coefficients)
•Compute gradients and update models
•The model learns patterns from this data
•Can be used repeatedly during training
•Larger is generally better for model quality

Test Set Purpose

•Estimate generalization performance
•Provide unbiased error estimate
•Simulate deployment conditions
•Must be used ONLY ONCE, at the very end
•Should be representative of production data

train_test_split.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
 
# Load data
X, y = load_iris(return_X_y=True)
print(f"Total samples: {len(X)}")
 
# Split data: 80% train, 20% test
# random_state ensures reproducibility
# stratify=y ensures class balance is preserved in both sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,      # 20% for testing
    random_state=42,    # Reproducibility
    stratify=y          # Maintain class proportions
)
 
print(f"Training samples: {len(X_train)} ({len(X_train)/len(X)*100:.0f}%)")
print(f"Test samples: {len(X_test)} ({len(X_test)/len(X)*100:.0f}%)")
 
# Verify stratification
import numpy as np
print(f"\nClass distribution in full data: {np.bincount(y)}")
print(f"Class distribution in training: {np.bincount(y_train)}")
print(f"Class distribution in test: {np.bincount(y_test)}")
 
# Train model ONLY on training data
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
 
# Evaluate on BOTH sets to see the difference
train_acc = accuracy_score(y_train, model.predict(X_train))
test_acc = accuracy_score(y_test, model.predict(X_test))
 
print(f"\nTraining accuracy: {train_acc:.4f}")
print(f"Test accuracy: {test_acc:.4f}")
print(f"Gap: {train_acc - test_acc:.4f} (smaller is better)")
 
# If gap is large → model is overfitting
# If both are low → model is underfitting

The Cardinal Rule of the Test Set

The test set must remain untouched until final evaluation. If you use test set performance to make ANY decisions—algorithm choice, hyperparameter tuning, feature selection—you corrupt its validity. The test set becomes part of your 'training' process, and its error estimate becomes biased.

The Validation Set: Enabling Model Selection

The train-test split has a critical limitation: we can't use the test set to make decisions.

But in practice, we need to make many decisions:

Which algorithm to use?
What hyperparameters to set?
Which features to include?
When to stop training?

If we use the test set to answer these questions, we're implicitly fitting to the test set, and our performance estimate becomes optimistically biased.

The Solution: A Three-Way Split

We introduce a third subset—the validation set (also called development set or dev set):

$$\mathcal{D} = \mathcal{D}{train} \cup \mathcal{D}{val} \cup \mathcal{D}_{test}$$

All three sets must be mutually exclusive: $$\mathcal{D}{train} \cap \mathcal{D}{val} = \emptyset, \quad \mathcal{D}{train} \cap \mathcal{D}{test} = \emptyset, \quad \mathcal{D}{val} \cap \mathcal{D}{test} = \emptyset$$

The Three Dataset Roles
Set	Typical Size	Purpose	How Often Used	Who 'Sees' It
Training	60-80%	Learn model parameters	Many times (each epoch)	The learning algorithm
Validation	10-20%	Tune hyperparameters, select models, early stopping	Many times (after each experiment)	The engineer (indirectly)
Test	10-20%	Final, unbiased performance estimate	Once (at the very end)	No one until final evaluation

The Workflow:

Split data into train/validation/test (e.g., 60/20/20 or 80/10/10)
Train multiple model variants on the training set
Evaluate each variant on the validation set
Select the best model based on validation performance
Retrain the selected model on train + validation combined (optional but recommended)
Evaluate the final model on the test set exactly once
Report the test set performance as the expected real-world performance

This workflow ensures that your final performance estimate comes from data that was never involved in any decision-making process.

train_val_test_split.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
 
# Generate synthetic data
X, y = make_classification(n_samples=1000, n_features=20, 
                           n_informative=10, random_state=42)
 
# First split: separate test set (final evaluation only)
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
 
# Second split: separate validation set from training
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
)
# 0.25 of 0.8 = 0.2, so we get 60/20/20 split
 
print(f"Training set: {len(X_train)} samples (60%)")
print(f"Validation set: {len(X_val)} samples (20%)")
print(f"Test set: {len(X_test)} samples (20%)")
 
# Step 1: Train multiple models on TRAINING data only
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(kernel='rbf', random_state=42)
}
 
# Step 2: Evaluate on VALIDATION set to choose best model
print("\n--- Model Selection (using validation set) ---")
val_scores = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    val_pred = model.predict(X_val)
    val_scores[name] = accuracy_score(y_val, val_pred)
    print(f"{name}: Validation Accuracy = {val_scores[name]:.4f}")
 
# Step 3: Select best model based on validation performance
best_model_name = max(val_scores, key=val_scores.get)
print(f"\nBest model: {best_model_name}")
 
# Step 4: Retrain on train + validation combined
X_train_full = np.vstack([X_train, X_val])
y_train_full = np.concatenate([y_train, y_val])
 
final_model = models[best_model_name].__class__(**models[best_model_name].get_params())
final_model.fit(X_train_full, y_train_full)
 
# Step 5: Final evaluation on TEST set (only once!)
test_pred = final_model.predict(X_test)
test_accuracy = accuracy_score(y_test, test_pred)
 
print(f"\n--- Final Evaluation (TEST SET - use only once!) ---")
print(f"Test Accuracy: {test_accuracy:.4f}")
print("This is our best estimate of real-world performance.")

Why Retrain on Train + Validation?

Once you've selected your best model using the validation set, that validation data has served its purpose. You can now merge it with training data to train the final model on 80% of your data instead of 60%, potentially improving performance. The test set remains pristine for final evaluation.

Cross-Validation: Robust Estimation with Limited Data

The holdout approach has a significant limitation: the validation estimate is based on a single random split. If you're unlucky, an 'easy' subset ends up in validation, giving optimistic estimates—or a 'hard' subset gives pessimistic estimates.

Cross-Validation (CV) addresses this by systematically using all data for both training and validation.

K-Fold Cross-Validation:

Divide training data into $k$ equal-sized folds
For each fold $i$ from 1 to $k$:
- Train on all folds except fold $i$
- Validate on fold $i$
- Record validation score
Average the $k$ validation scores

This gives a more robust estimate because every data point is used for validation exactly once, and the estimate is less dependent on any single random split.

Converting Mermaid diagram...

Cross-Validation Variants
Method	Description	When to Use	Pros	Cons
K-Fold CV	Split into k folds, rotate validation	General purpose, k=5 or 10	Robust, efficient	Moderate computation
Stratified K-Fold	K-Fold preserving class proportions	Classification with imbalance	Better for classification	Slightly more complex
Leave-One-Out (LOO)	K-Fold where k=n	Very small datasets	Maximum training data	High variance, slow
Repeated K-Fold	K-Fold repeated multiple times	When estimate variance matters	Lower variance estimate	More computation
Time Series CV	Expanding/sliding window	Sequential/temporal data	Respects time ordering	Less training data early on
Group K-Fold	Keep groups together	When samples aren't independent	Prevents data leakage	Requires group labels

cross_validation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
from sklearn.model_selection import (
    cross_val_score, KFold, StratifiedKFold, 
    LeaveOneOut, TimeSeriesSplit
)
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
import numpy as np
 
X, y = load_iris(return_X_y=True)
model = LogisticRegression(max_iter=200)
 
# Standard K-Fold (k=5)
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
scores_kfold = cross_val_score(model, X, y, cv=kfold)
print(f"5-Fold CV: {scores_kfold.mean():.4f} ± {scores_kfold.std():.4f}")
print(f"  Individual fold scores: {scores_kfold.round(4)}")
 
# Stratified K-Fold (maintains class proportions in each fold)
skfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores_strat = cross_val_score(model, X, y, cv=skfold)
print(f"\nStratified 5-Fold CV: {scores_strat.mean():.4f} ± {scores_strat.std():.4f}")
 
# 10-Fold (more folds = more training data per fold, but more variance)
scores_10fold = cross_val_score(model, X, y, cv=10)
print(f"\n10-Fold CV: {scores_10fold.mean():.4f} ± {scores_10fold.std():.4f}")
 
# Leave-One-Out (extreme case: k = n)
# Only practical for small datasets!
loo = LeaveOneOut()
scores_loo = cross_val_score(model, X, y, cv=loo)
print(f"\nLeave-One-Out CV: {scores_loo.mean():.4f} (n={len(scores_loo)} folds)")
 
# Confidence interval (approximate 95% CI)
mean = scores_kfold.mean()
std = scores_kfold.std()
ci_lower = mean - 1.96 * std / np.sqrt(len(scores_kfold))
ci_upper = mean + 1.96 * std / np.sqrt(len(scores_kfold))
print(f"\n95% CI for 5-Fold CV: [{ci_lower:.4f}, {ci_upper:.4f}]")

Cross-Validation is for Model Selection, Not Final Evaluation

Cross-validation should be performed on your train+validation data only. Your test set remains held out. CV helps you choose the best model, but the final performance estimate still comes from the test set.

Stratification and Splitting Strategies

How you split data matters as much as that you split it. Poor splitting strategies can introduce bias or fail to represent the true data distribution.

Random Splitting:

The default approach: randomly assign each sample to train/val/test. Works well when:

Samples are independent and identically distributed (i.i.d.)
Dataset is large enough for random variation to average out
No special structure needs to be preserved

Stratified Splitting:

Ensures that the class distribution (or target distribution) is preserved in each split. Essential when:

Classes are imbalanced (e.g., 95% negative, 5% positive)
Small dataset where random splits might exclude a class
You want representative evaluation across all classes

stratification_importance.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import numpy as np
from sklearn.model_selection import train_test_split
 
# Create imbalanced dataset: 95% class 0, 5% class 1
np.random.seed(42)
n = 200
y = np.array([0] * 190 + [1] * 10)  # 95% vs 5%
X = np.random.randn(n, 5)
 
print("Original class distribution:", np.bincount(y))
print(f"Class 0: {np.mean(y==0)*100:.1f}%, Class 1: {np.mean(y==1)*100:.1f}%")
 
# Non-stratified split (risky!)
print("\n--- Random Split (non-stratified) ---")
for seed in [0, 1, 2, 3, 4]:
    _, _, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=seed  # No stratify!
    )
    print(f"Seed {seed}: Test class 1 count = {sum(y_test==1)}/{len(y_test)}")
# Notice: Some splits might have 0 or very few class 1 samples in test!
 
# Stratified split (safe!)
print("\n--- Stratified Split ---")
for seed in [0, 1, 2, 3, 4]:
    _, _, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=seed, stratify=y  # Stratified!
    )
    print(f"Seed {seed}: Test class 1 count = {sum(y_test==1)}/{len(y_test)}")
# Stratified: Consistent class proportions in every split

Special Splitting Strategies for Special Data:

Not all data can be randomly split. Some domains require specialized strategies:

Specialized Splitting Strategies

•Time Series Splitting: Training data must be before test data. You cannot use future information to predict the past. Use expanding or sliding window approaches.
•Group Splitting: When samples belong to groups (e.g., multiple images per patient), keep entire groups together. Otherwise, similar samples leak between train and test.
•Geographic Splitting: For spatial data, test on held-out regions to assess spatial generalization. Nearby locations are often correlated.
•Subject Splitting: In medical/behavioral studies, train on some subjects, test on others. Don't let the same person appear in both sets.
•Adversarial Splitting: Deliberately make test set 'harder' to assess robustness. E.g., test on out-of-distribution samples.

The Data Leakage Risk

If correlated samples end up in both train and test, you're not evaluating generalization—you're evaluating memorization in disguise. For example, if multiple lab readings from the same patient appear in both sets, the model might learn patient-specific patterns rather than disease patterns.

Nested Cross-Validation: The Gold Standard

When you use cross-validation to select hyperparameters, you introduce a subtle bias: you're choosing hyperparameters that perform well on the validation folds. The reported CV score is optimistically biased.

Nested Cross-Validation solves this with two levels of CV:

Outer loop: Estimates true performance (like a test set)
Inner loop: Selects best hyperparameters (like a validation set)

For each outer fold:
    Set aside outer fold as test
    On remaining data:
        Run inner CV to select best hyperparameters
    Train model with best hyperparameters on all inner data
    Evaluate on outer fold (test)
Average outer fold scores = unbiased performance estimate

This provides an unbiased estimate of optimized model performance—critical for scientific reporting.

nested_cross_validation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
from sklearn.model_selection import cross_val_score, KFold, GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
import numpy as np
 
X, y = load_breast_cancer(return_X_y=True)
 
# Hyperparameter grid
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': ['scale', 'auto', 0.01, 0.1]
}
 
# WRONG: Non-nested CV (biased estimate)
# This finds best params and reports the same CV score
print("=== Non-Nested CV (BIASED) ===")
inner_cv = KFold(n_splits=5, shuffle=True, random_state=42)
grid_search = GridSearchCV(SVC(), param_grid, cv=inner_cv, scoring='accuracy')
grid_search.fit(X, y)
print(f"Best params: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")
print("⚠️ This score is optimistically biased!")
 
# CORRECT: Nested CV (unbiased estimate)
print("\n=== Nested CV (UNBIASED) ===")
outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)
inner_cv = KFold(n_splits=3, shuffle=True, random_state=42)  # 3 inner for speed
 
# Inner loop: hyperparameter tuning
clf = GridSearchCV(SVC(), param_grid, cv=inner_cv, scoring='accuracy')
 
# Outer loop: performance estimation
nested_scores = cross_val_score(clf, X, y, cv=outer_cv, scoring='accuracy')
 
print(f"Nested CV scores: {nested_scores.round(4)}")
print(f"Mean: {nested_scores.mean():.4f} ± {nested_scores.std():.4f}")
print("✓ This is an unbiased estimate of tuned model performance")
 
# The nested score is typically lower (less optimistic) than non-nested
print(f"\nDifference: {grid_search.best_score_ - nested_scores.mean():.4f}")
print("(Positive difference = non-nested was overly optimistic)")

When to Use Nested CV

Nested CV is computationally expensive (outer_k × inner_k × hyperparameter_combinations model fits). Use it when: • Reporting results in scientific papers • Comparing algorithms fairly • Dataset is too small for a separate test set • You need an unbiased estimate with hyperparameter tuning

Practical Guidelines for Data Splitting

With the theory established, let's consolidate practical recommendations for different scenarios:

Data Splitting Recommendations by Dataset Size
Dataset Size	Recommended Approach	Rationale
Very Small (<100)	Leave-One-Out CV or 5-fold + nested CV	Every sample matters; need robust estimation
Small (100-1,000)	Stratified 10-fold CV + separate test set	Balance between training data and estimation reliability
Medium (1,000-10,000)	70/15/15 or 80/10/10 split with stratification	Enough data for reliable held-out estimates
Large (10,000-100,000)	80/10/10 or 90/5/5 split	Test set of thousands gives tight confidence intervals
Very Large (>100,000)	Even 99/0.5/0.5 is often fine	0.5% of 1M is 5,000 samples—plenty for evaluation

Best Practices Checklist

•Always stratify for classification problems, especially with class imbalance
•Set a random seed for reproducibility (but report results across multiple seeds if possible)
•Never look at the test set until final evaluation—resist the temptation
•Document your split so others can reproduce your results exactly
•Match test set to production in terms of distribution and characteristics
•Use time-based splits if your data has temporal dependencies
•Keep groups together if samples aren't independent
•Report confidence intervals not just point estimates
•Perform multiple runs with different seeds to assess variance
•Use nested CV when reporting scientific results with hyperparameter tuning

The 'Just Peek Once' Fallacy

"I'll just check the test set once to see if I'm on the right track." This is the beginning of p-hacking your model. Each peek affects your future decisions. Each decision makes the test score more optimistic. By the time you 'officially' evaluate, you've already implicitly fit to the test set. Discipline is essential.

Summary: The Methodology of Honest Evaluation

We've established the fundamental methodology for evaluating machine learning models. This isn't optional infrastructure—it's the scientific foundation that makes ML claims credible.

Key Takeaways

•Training error is deceptive — Models can memorize training data without learning generalizable patterns. Never trust training performance alone.
•Test sets must remain untouched — Any decision made using the test set corrupts its validity. The test set simulates deployment.
•Validation sets enable iteration — Use validation data for model selection, hyperparameter tuning, and early stopping. It's your sandbox for experimentation.
•Cross-validation provides robustness — When data is limited, CV gives more reliable estimates than single holdout splits.
•Stratification preserves distributions — Essential for imbalanced data to ensure representative evaluation.
•Nested CV eliminates bias — When tuning hyperparameters, nested CV prevents optimistic bias in reported performance.
•Splitting strategy matters — Time series, grouped data, and spatial data require specialized splitting approaches.

What's Next:

We've learned how to evaluate whether a model works. But what exactly is a model trying to do? The next page introduces the hypothesis space—the set of all possible functions a learning algorithm considers, and how this choice shapes learning.

Page Complete

You now understand the methodology of model evaluation: training, validation, and test sets; cross-validation; stratification; and nested CV. These aren't just techniques—they're the scientific method applied to machine learning. Next, we'll explore the hypothesis space.

2 / 5

Loading learning content...

Machine LearningKey Concepts and Terminology

Key Concepts and Terminology

LevelBeginner

Duration90 mins

TopicKey Concepts and Terminology

2 / 5

Training, Validation, and Test Sets: The Foundation of Model Evaluation

The Central Question of Machine Learning

Every machine learning project ultimately faces a fundamental question: Will this model work on new data?

The Cost of Getting This Wrong

The Memorization Problem

To understand why data splits matter, we must first understand what happens without them.

The Overfitting Scenario:

Imagine training a model on 1,000 labeled examples and then evaluating its accuracy on... those same 1,000 examples. What could go wrong?

Everything.

overfitting_demonstration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
 
# Generate synthetic data: y = sin(x) + noise
np.random.seed(42)
n_samples = 20
X = np.linspace(0, 3, n_samples).reshape(-1, 1)
y_true = np.sin(2 * X).ravel()
y = y_true + 0.1 * np.random.randn(n_samples)  # Add noise
 
# Fit models of increasing complexity
degrees = [1, 3, 15]
X_plot = np.linspace(0, 3, 100).reshape(-1, 1)
 
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for ax, degree in zip(axes, degrees):
    # Create polynomial features
    poly = PolynomialFeatures(degree)
    X_poly = poly.fit_transform(X)
    X_plot_poly = poly.transform(X_plot)
    
    # Fit model
    model = LinearRegression()
    model.fit(X_poly, y)
    
    # Evaluate on training data (THIS IS THE MISTAKE)
    y_pred_train = model.predict(X_poly)
    train_mse = mean_squared_error(y, y_pred_train)
    
    # Plot
    ax.scatter(X, y, color='blue', s=50, label='Training data')
    ax.plot(X_plot, model.predict(X_plot_poly), 'r-', linewidth=2, label='Model')
    ax.set_title(f'Degree {degree}\nTrain MSE: {train_mse:.4f}')
    ax.legend()
 
plt.tight_layout()
plt.suptitle('Lower Training Error ≠ Better Model!', y=1.02, fontsize=14)
plt.show()
 
# The degree-15 polynomial has near-zero training error
# But it will perform terribly on new data!
print("Lesson: Training error is a BIASED estimate of true performance")
print("The more complex the model, the more optimistic (and misleading) training error becomes")

The Fundamental Insight

The Train-Test Split: Honest Evaluation

The simplest solution to the memorization problem is the holdout method: split your data into two disjoint sets.

Training Set:

Used to fit the model parameters
The model 'sees' this data and learns from it
Larger training sets generally lead to better models

Test Set:

Held out during training—the model never sees these examples
Used to estimate how the model will perform on new data
Must remain untouched until final evaluation

Formally, given dataset $\mathcal{D}$ with $n$ examples:

$$\mathcal{D} = \mathcal{D}{train} \cup \mathcal{D}{test}, \quad \mathcal{D}{train} \cap \mathcal{D}{test} = \emptyset$$

Common split ratios:

80/20: 80% training, 20% test (most common)
70/30: More conservative, larger test set for tighter confidence intervals
90/10: When data is scarce and you need more training examples
99/1 or similar: Very large datasets (millions of examples) where even 1% is substantial

Training Set Purpose

•Fit model parameters (weights, coefficients)
•Compute gradients and update models
•The model learns patterns from this data
•Can be used repeatedly during training
•Larger is generally better for model quality

Test Set Purpose

•Estimate generalization performance
•Provide unbiased error estimate
•Simulate deployment conditions
•Must be used ONLY ONCE, at the very end
•Should be representative of production data

train_test_split.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
 
# Load data
X, y = load_iris(return_X_y=True)
print(f"Total samples: {len(X)}")
 
# Split data: 80% train, 20% test
# random_state ensures reproducibility
# stratify=y ensures class balance is preserved in both sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,      # 20% for testing
    random_state=42,    # Reproducibility
    stratify=y          # Maintain class proportions
)
 
print(f"Training samples: {len(X_train)} ({len(X_train)/len(X)*100:.0f}%)")
print(f"Test samples: {len(X_test)} ({len(X_test)/len(X)*100:.0f}%)")
 
# Verify stratification
import numpy as np
print(f"\nClass distribution in full data: {np.bincount(y)}")
print(f"Class distribution in training: {np.bincount(y_train)}")
print(f"Class distribution in test: {np.bincount(y_test)}")
 
# Train model ONLY on training data
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
 
# Evaluate on BOTH sets to see the difference
train_acc = accuracy_score(y_train, model.predict(X_train))
test_acc = accuracy_score(y_test, model.predict(X_test))
 
print(f"\nTraining accuracy: {train_acc:.4f}")
print(f"Test accuracy: {test_acc:.4f}")
print(f"Gap: {train_acc - test_acc:.4f} (smaller is better)")
 
# If gap is large → model is overfitting
# If both are low → model is underfitting

The Cardinal Rule of the Test Set

The Validation Set: Enabling Model Selection

The train-test split has a critical limitation: we can't use the test set to make decisions.

But in practice, we need to make many decisions:

Which algorithm to use?
What hyperparameters to set?
Which features to include?
When to stop training?

If we use the test set to answer these questions, we're implicitly fitting to the test set, and our performance estimate becomes optimistically biased.

The Solution: A Three-Way Split

We introduce a third subset—the validation set (also called development set or dev set):

$$\mathcal{D} = \mathcal{D}{train} \cup \mathcal{D}{val} \cup \mathcal{D}_{test}$$

The Three Dataset Roles
Set	Typical Size	Purpose	How Often Used	Who 'Sees' It
Training	60-80%	Learn model parameters	Many times (each epoch)	The learning algorithm
Validation	10-20%	Tune hyperparameters, select models, early stopping	Many times (after each experiment)	The engineer (indirectly)
Test	10-20%	Final, unbiased performance estimate	Once (at the very end)	No one until final evaluation

The Workflow:

Split data into train/validation/test (e.g., 60/20/20 or 80/10/10)
Train multiple model variants on the training set
Evaluate each variant on the validation set
Select the best model based on validation performance
Retrain the selected model on train + validation combined (optional but recommended)
Evaluate the final model on the test set exactly once
Report the test set performance as the expected real-world performance

This workflow ensures that your final performance estimate comes from data that was never involved in any decision-making process.

train_val_test_split.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
 
# Generate synthetic data
X, y = make_classification(n_samples=1000, n_features=20, 
                           n_informative=10, random_state=42)
 
# First split: separate test set (final evaluation only)
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
 
# Second split: separate validation set from training
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
)
# 0.25 of 0.8 = 0.2, so we get 60/20/20 split
 
print(f"Training set: {len(X_train)} samples (60%)")
print(f"Validation set: {len(X_val)} samples (20%)")
print(f"Test set: {len(X_test)} samples (20%)")
 
# Step 1: Train multiple models on TRAINING data only
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(kernel='rbf', random_state=42)
}
 
# Step 2: Evaluate on VALIDATION set to choose best model
print("\n--- Model Selection (using validation set) ---")
val_scores = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    val_pred = model.predict(X_val)
    val_scores[name] = accuracy_score(y_val, val_pred)
    print(f"{name}: Validation Accuracy = {val_scores[name]:.4f}")
 
# Step 3: Select best model based on validation performance
best_model_name = max(val_scores, key=val_scores.get)
print(f"\nBest model: {best_model_name}")
 
# Step 4: Retrain on train + validation combined
X_train_full = np.vstack([X_train, X_val])
y_train_full = np.concatenate([y_train, y_val])
 
final_model = models[best_model_name].__class__(**models[best_model_name].get_params())
final_model.fit(X_train_full, y_train_full)
 
# Step 5: Final evaluation on TEST set (only once!)
test_pred = final_model.predict(X_test)
test_accuracy = accuracy_score(y_test, test_pred)
 
print(f"\n--- Final Evaluation (TEST SET - use only once!) ---")
print(f"Test Accuracy: {test_accuracy:.4f}")
print("This is our best estimate of real-world performance.")

Why Retrain on Train + Validation?

Cross-Validation: Robust Estimation with Limited Data

Cross-Validation (CV) addresses this by systematically using all data for both training and validation.

K-Fold Cross-Validation:

Divide training data into $k$ equal-sized folds
For each fold $i$ from 1 to $k$:
- Train on all folds except fold $i$
- Validate on fold $i$
- Record validation score
Average the $k$ validation scores

This gives a more robust estimate because every data point is used for validation exactly once, and the estimate is less dependent on any single random split.

Converting Mermaid diagram...

Cross-Validation Variants
Method	Description	When to Use	Pros	Cons
K-Fold CV	Split into k folds, rotate validation	General purpose, k=5 or 10	Robust, efficient	Moderate computation
Stratified K-Fold	K-Fold preserving class proportions	Classification with imbalance	Better for classification	Slightly more complex
Leave-One-Out (LOO)	K-Fold where k=n	Very small datasets	Maximum training data	High variance, slow
Repeated K-Fold	K-Fold repeated multiple times	When estimate variance matters	Lower variance estimate	More computation
Time Series CV	Expanding/sliding window	Sequential/temporal data	Respects time ordering	Less training data early on
Group K-Fold	Keep groups together	When samples aren't independent	Prevents data leakage	Requires group labels

cross_validation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
from sklearn.model_selection import (
    cross_val_score, KFold, StratifiedKFold, 
    LeaveOneOut, TimeSeriesSplit
)
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
import numpy as np
 
X, y = load_iris(return_X_y=True)
model = LogisticRegression(max_iter=200)
 
# Standard K-Fold (k=5)
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
scores_kfold = cross_val_score(model, X, y, cv=kfold)
print(f"5-Fold CV: {scores_kfold.mean():.4f} ± {scores_kfold.std():.4f}")
print(f"  Individual fold scores: {scores_kfold.round(4)}")
 
# Stratified K-Fold (maintains class proportions in each fold)
skfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores_strat = cross_val_score(model, X, y, cv=skfold)
print(f"\nStratified 5-Fold CV: {scores_strat.mean():.4f} ± {scores_strat.std():.4f}")
 
# 10-Fold (more folds = more training data per fold, but more variance)
scores_10fold = cross_val_score(model, X, y, cv=10)
print(f"\n10-Fold CV: {scores_10fold.mean():.4f} ± {scores_10fold.std():.4f}")
 
# Leave-One-Out (extreme case: k = n)
# Only practical for small datasets!
loo = LeaveOneOut()
scores_loo = cross_val_score(model, X, y, cv=loo)
print(f"\nLeave-One-Out CV: {scores_loo.mean():.4f} (n={len(scores_loo)} folds)")
 
# Confidence interval (approximate 95% CI)
mean = scores_kfold.mean()
std = scores_kfold.std()
ci_lower = mean - 1.96 * std / np.sqrt(len(scores_kfold))
ci_upper = mean + 1.96 * std / np.sqrt(len(scores_kfold))
print(f"\n95% CI for 5-Fold CV: [{ci_lower:.4f}, {ci_upper:.4f}]")

Cross-Validation is for Model Selection, Not Final Evaluation

Stratification and Splitting Strategies

How you split data matters as much as that you split it. Poor splitting strategies can introduce bias or fail to represent the true data distribution.

Random Splitting:

The default approach: randomly assign each sample to train/val/test. Works well when:

Samples are independent and identically distributed (i.i.d.)
Dataset is large enough for random variation to average out
No special structure needs to be preserved

Stratified Splitting:

Ensures that the class distribution (or target distribution) is preserved in each split. Essential when:

Classes are imbalanced (e.g., 95% negative, 5% positive)
Small dataset where random splits might exclude a class
You want representative evaluation across all classes

stratification_importance.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import numpy as np
from sklearn.model_selection import train_test_split
 
# Create imbalanced dataset: 95% class 0, 5% class 1
np.random.seed(42)
n = 200
y = np.array([0] * 190 + [1] * 10)  # 95% vs 5%
X = np.random.randn(n, 5)
 
print("Original class distribution:", np.bincount(y))
print(f"Class 0: {np.mean(y==0)*100:.1f}%, Class 1: {np.mean(y==1)*100:.1f}%")
 
# Non-stratified split (risky!)
print("\n--- Random Split (non-stratified) ---")
for seed in [0, 1, 2, 3, 4]:
    _, _, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=seed  # No stratify!
    )
    print(f"Seed {seed}: Test class 1 count = {sum(y_test==1)}/{len(y_test)}")
# Notice: Some splits might have 0 or very few class 1 samples in test!
 
# Stratified split (safe!)
print("\n--- Stratified Split ---")
for seed in [0, 1, 2, 3, 4]:
    _, _, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=seed, stratify=y  # Stratified!
    )
    print(f"Seed {seed}: Test class 1 count = {sum(y_test==1)}/{len(y_test)}")
# Stratified: Consistent class proportions in every split

Special Splitting Strategies for Special Data:

Not all data can be randomly split. Some domains require specialized strategies:

Specialized Splitting Strategies

•Time Series Splitting: Training data must be before test data. You cannot use future information to predict the past. Use expanding or sliding window approaches.
•Group Splitting: When samples belong to groups (e.g., multiple images per patient), keep entire groups together. Otherwise, similar samples leak between train and test.
•Geographic Splitting: For spatial data, test on held-out regions to assess spatial generalization. Nearby locations are often correlated.
•Subject Splitting: In medical/behavioral studies, train on some subjects, test on others. Don't let the same person appear in both sets.
•Adversarial Splitting: Deliberately make test set 'harder' to assess robustness. E.g., test on out-of-distribution samples.

The Data Leakage Risk

Nested Cross-Validation: The Gold Standard

Nested Cross-Validation solves this with two levels of CV:

Outer loop: Estimates true performance (like a test set)
Inner loop: Selects best hyperparameters (like a validation set)

For each outer fold:
    Set aside outer fold as test
    On remaining data:
        Run inner CV to select best hyperparameters
    Train model with best hyperparameters on all inner data
    Evaluate on outer fold (test)
Average outer fold scores = unbiased performance estimate

This provides an unbiased estimate of optimized model performance—critical for scientific reporting.

nested_cross_validation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
from sklearn.model_selection import cross_val_score, KFold, GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
import numpy as np
 
X, y = load_breast_cancer(return_X_y=True)
 
# Hyperparameter grid
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': ['scale', 'auto', 0.01, 0.1]
}
 
# WRONG: Non-nested CV (biased estimate)
# This finds best params and reports the same CV score
print("=== Non-Nested CV (BIASED) ===")
inner_cv = KFold(n_splits=5, shuffle=True, random_state=42)
grid_search = GridSearchCV(SVC(), param_grid, cv=inner_cv, scoring='accuracy')
grid_search.fit(X, y)
print(f"Best params: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")
print("⚠️ This score is optimistically biased!")
 
# CORRECT: Nested CV (unbiased estimate)
print("\n=== Nested CV (UNBIASED) ===")
outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)
inner_cv = KFold(n_splits=3, shuffle=True, random_state=42)  # 3 inner for speed
 
# Inner loop: hyperparameter tuning
clf = GridSearchCV(SVC(), param_grid, cv=inner_cv, scoring='accuracy')
 
# Outer loop: performance estimation
nested_scores = cross_val_score(clf, X, y, cv=outer_cv, scoring='accuracy')
 
print(f"Nested CV scores: {nested_scores.round(4)}")
print(f"Mean: {nested_scores.mean():.4f} ± {nested_scores.std():.4f}")
print("✓ This is an unbiased estimate of tuned model performance")
 
# The nested score is typically lower (less optimistic) than non-nested
print(f"\nDifference: {grid_search.best_score_ - nested_scores.mean():.4f}")
print("(Positive difference = non-nested was overly optimistic)")

When to Use Nested CV

Practical Guidelines for Data Splitting

With the theory established, let's consolidate practical recommendations for different scenarios:

Data Splitting Recommendations by Dataset Size
Dataset Size	Recommended Approach	Rationale
Very Small (<100)	Leave-One-Out CV or 5-fold + nested CV	Every sample matters; need robust estimation
Small (100-1,000)	Stratified 10-fold CV + separate test set	Balance between training data and estimation reliability
Medium (1,000-10,000)	70/15/15 or 80/10/10 split with stratification	Enough data for reliable held-out estimates
Large (10,000-100,000)	80/10/10 or 90/5/5 split	Test set of thousands gives tight confidence intervals
Very Large (>100,000)	Even 99/0.5/0.5 is often fine	0.5% of 1M is 5,000 samples—plenty for evaluation

Best Practices Checklist

•Always stratify for classification problems, especially with class imbalance
•Set a random seed for reproducibility (but report results across multiple seeds if possible)
•Never look at the test set until final evaluation—resist the temptation
•Document your split so others can reproduce your results exactly
•Match test set to production in terms of distribution and characteristics
•Use time-based splits if your data has temporal dependencies
•Keep groups together if samples aren't independent
•Report confidence intervals not just point estimates
•Perform multiple runs with different seeds to assess variance
•Use nested CV when reporting scientific results with hyperparameter tuning

The 'Just Peek Once' Fallacy

Summary: The Methodology of Honest Evaluation

We've established the fundamental methodology for evaluating machine learning models. This isn't optional infrastructure—it's the scientific foundation that makes ML claims credible.

Key Takeaways

•Training error is deceptive — Models can memorize training data without learning generalizable patterns. Never trust training performance alone.
•Test sets must remain untouched — Any decision made using the test set corrupts its validity. The test set simulates deployment.
•Validation sets enable iteration — Use validation data for model selection, hyperparameter tuning, and early stopping. It's your sandbox for experimentation.
•Cross-validation provides robustness — When data is limited, CV gives more reliable estimates than single holdout splits.
•Stratification preserves distributions — Essential for imbalanced data to ensure representative evaluation.
•Nested CV eliminates bias — When tuning hyperparameters, nested CV prevents optimistic bias in reported performance.
•Splitting strategy matters — Time series, grouped data, and spatial data require specialized splitting approaches.

What's Next:

Page Complete

2 / 5