Loading content...
With a well-formulated problem, clean data, and thoughtfully engineered features, you arrive at what most people consider the 'ML part' of ML: model selection and training. This is where algorithms transform data into predictive power.
But model selection isn't about finding the 'best' algorithm—there is no universal best. It's about finding the algorithm that's best suited to your specific problem, data, and constraints. A simple logistic regression might outperform a complex neural network when data is limited. A gradient boosting model might dominate on tabular data but fail on images. A model with 99% accuracy might be useless if it requires 10 seconds per prediction in a 10ms latency environment.
The goal is not to use the most sophisticated algorithm, but to achieve the best outcome given your constraints.
By the end of this page, you will understand how to navigate the landscape of ML algorithms, match algorithms to problem types and data characteristics, configure hyperparameters effectively, train models to convergence, handle overfitting and underfitting, and build robust training pipelines.
The universe of ML algorithms is vast, but most fall into recognizable families with shared characteristics. Understanding these families helps you navigate toward appropriate candidates for your problem.
| Family | Examples | Strengths | Weaknesses | Best For |
|---|---|---|---|---|
| Linear Models | Linear/Logistic Regression, Ridge, Lasso, ElasticNet | Fast, interpretable, good baseline | Can't capture nonlinearity without feature engineering | Wide data (many features, few samples), need interpretability |
| Tree-Based | Decision Trees, Random Forest, XGBoost, LightGBM, CatBoost | Handle nonlinearity, feature interactions, robust | Can overfit, less interpretable than linear | Tabular data, competitions, most supervised problems |
| Support Vector Machines | SVM, SVR, One-class SVM | Effective in high dimensions, kernel flexibility | Slow with large datasets, need feature scaling | Small-medium datasets, text classification |
| Neighbors | KNN, Radius Neighbors | Simple, no training, adapts to any distribution | Slow inference, memory-intensive, curse of dimensionality | Small datasets, anomaly detection, prototyping |
| Naive Bayes | Gaussian, Multinomial, Bernoulli NB | Very fast, works with little data | Assumes feature independence (often wrong) | Text classification, spam filtering, real-time |
| Neural Networks | MLP, CNN, RNN, Transformers | Learn complex patterns, handle raw inputs | Need lots of data, slow to train, less interpretable | Images, text, speech, complex patterns, big data |
| Ensembles | Bagging, Boosting, Stacking | Combine multiple models for robustness | More complex, slower | When marginal improvements matter |
The No Free Lunch Theorem:
A fundamental result in ML: no algorithm dominates across all problems. An algorithm that excels on one type of data may fail on another. This is why benchmarking on your specific data is essential.
That said, some practical patterns emerge:
| Data Type | Typical Winner |
|---|---|
| Tabular data (structured) | Gradient boosting (XGBoost, LightGBM, CatBoost) |
| Images | Convolutional Neural Networks |
| Text (classification) | Transformers (BERT, etc.) or gradient boosting on TF-IDF |
| Time series | ARIMA, Prophet, LSTM, or gradient boosting with lag features |
| Small datasets (<1000 samples) | Linear models, SVMs, ensembles of simple models |
| Need interpretability | Linear models, decision trees, rule-based |
| Real-time inference | Light models, quantized networks, linear |
Always start with a simple baseline. Logistic regression for classification, linear regression for regression. This baseline tells you how much room for improvement exists and ensures you're not wasting effort on complex models when simple ones suffice.
Algorithm selection depends on multiple factors. Here's a framework for narrowing your choices:
Factor 1: Task Type
| Task | Candidate Algorithms |
|---|---|
| Binary classification | Logistic Regression, Random Forest, XGBoost, SVM, Neural Networks |
| Multi-class classification | Same as binary, with appropriate adaptation |
| Regression | Linear Regression, Ridge, Lasso, Random Forest, XGBoost, Neural Networks |
| Ranking | Learning to Rank (LambdaMART, RankNet), Gradient Boosting |
| Clustering | K-Means, DBSCAN, Hierarchical, Gaussian Mixture |
| Anomaly detection | Isolation Forest, One-class SVM, Autoencoders |
Factor 2: Data Characteristics
Factor 3: Constraints
| Constraint | Algorithm Implications |
|---|---|
| Latency < 10ms | Simple models, optimized inference, avoid large ensembles |
| Interpretability required | Linear models, decision trees, rule-based systems |
| Must run on edge/mobile | Quantized neural networks, very small models |
| Limited training data | Strong regularization, simple models, transfer learning |
| Real-time retraining | Online learning algorithms (SGD variants, Hoeffding trees) |
Factor 4: Maintenance Overhead
Complex models require more maintenance:
If a simpler model achieves 95% of the performance with 10% of the maintenance burden, it's often the right choice.
For most tabular problems today, the default is gradient boosting (XGBoost, LightGBM, or CatBoost). For text and images, transformer-based models (or fine-tuned versions) dominate. This reflects years of community benchmarking—these defaults are strong starting points.
Every algorithm has hyperparameters—settings that control learning behavior but aren't learned from data. Choosing good hyperparameters can dramatically affect performance.
Common Hyperparameters by Algorithm:
| Algorithm | Key Hyperparameters | Typical Impact |
|---|---|---|
| Linear Models | Regularization strength (C, alpha), regularization type (L1/L2) | Controls overfitting, sparsity |
| Random Forest | n_estimators, max_depth, min_samples_split, max_features | Variance-bias tradeoff, training time |
| XGBoost/LightGBM | n_estimators, max_depth, learning_rate, subsample, reg_alpha/lambda | Overfitting, training time, performance |
| SVM | C (regularization), kernel, gamma (for RBF) | Margin width, flexibility |
| Neural Networks | Layer sizes, learning rate, batch size, dropout, weight decay | Capacity, generalization, training dynamics |
| KNN | k (neighbors), distance metric, weights | Smoothness, local vs global |
Hyperparameter Search Strategies:
Manual Tuning: Start with defaults, adjust based on validation results and intuition. Fast for initial exploration.
Grid Search: Exhaustively try every combination in a grid. Thorough but exponentially expensive.
Random Search: Randomly sample hyperparameter combinations. More efficient than grid search for high-dimensional spaces.
Bayesian Optimization: Use a probabilistic model to guide search toward promising regions. More efficient for expensive evaluations.
Successive Halving / Hyperband: Allocate resources adaptively—train many configurations briefly, continue only the promising ones.
Automated ML (AutoML): Let the tool search (Optuna, Ray Tune, Auto-sklearn). Good for final optimization.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
import numpy as npfrom sklearn.model_selection import GridSearchCV, RandomizedSearchCV, cross_val_scorefrom sklearn.ensemble import RandomForestClassifierimport optuna # ===== GRID SEARCH =====# Exhaustive but expensiveparam_grid = { 'n_estimators': [100, 200, 300], 'max_depth': [5, 10, 15, None], 'min_samples_split': [2, 5, 10]} grid_search = GridSearchCV( RandomForestClassifier(random_state=42), param_grid, cv=5, scoring='roc_auc', n_jobs=-1)grid_search.fit(X_train, y_train)print(f"Best params: {grid_search.best_params_}")print(f"Best score: {grid_search.best_score_:.4f}") # ===== RANDOM SEARCH =====# More efficient for many hyperparametersparam_distributions = { 'n_estimators': [100, 200, 300, 400, 500], 'max_depth': [5, 10, 15, 20, None], 'min_samples_split': [2, 5, 10, 20], 'min_samples_leaf': [1, 2, 4], 'max_features': ['sqrt', 'log2', None]} random_search = RandomizedSearchCV( RandomForestClassifier(random_state=42), param_distributions, n_iter=50, # Number of random combinations to try cv=5, scoring='roc_auc', random_state=42, n_jobs=-1)random_search.fit(X_train, y_train) # ===== BAYESIAN OPTIMIZATION WITH OPTUNA =====def objective(trial): params = { 'n_estimators': trial.suggest_int('n_estimators', 100, 500), 'max_depth': trial.suggest_int('max_depth', 3, 20), 'min_samples_split': trial.suggest_int('min_samples_split', 2, 20), 'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 10), 'max_features': trial.suggest_categorical('max_features', ['sqrt', 'log2', None]) } model = RandomForestClassifier(**params, random_state=42) scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc') return scores.mean() study = optuna.create_study(direction='maximize')study.optimize(objective, n_trials=100, show_progress_bar=True) print(f"Best trial: {study.best_trial.params}")print(f"Best score: {study.best_trial.value:.4f}")Hyperparameter tuning uses the validation set to make decisions. This means the validation performance becomes optimistic—you've selected hyperparameters that happen to work well on that particular set. Always reserve a final test set that's never seen during tuning for unbiased evaluation.
Training is the process of learning model parameters from data. Different algorithms have different training dynamics, but common principles apply.
Training Pipeline Components:
Key Training Concepts:
Monitoring Training:
During training, monitor both training and validation metrics. The typical progression:
Learning Rate Schedules:
Learning rate often needs to change during training:
| Schedule | Description | When to Use |
|---|---|---|
| Constant | Fixed learning rate | Simple problems, initial exploration |
| Step decay | Reduce by factor every N epochs | Standard practice for many problems |
| Cosine annealing | Smooth decay following cosine curve | Neural networks, longer training |
| Warmup + decay | Start low, increase, then decrease | Transformers, large models |
| Reduce on plateau | Decrease when validation stalls | Adaptive to training dynamics |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
import numpy as npfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import roc_auc_scoreimport xgboost as xgb # ===== BASIC TRAINING WITH VALIDATION =====X_train, X_val, y_train, y_val = train_test_split( X, y, test_size=0.2, stratify=y, random_state=42) # ===== XGBOOST WITH EARLY STOPPING =====model = xgb.XGBClassifier( n_estimators=1000, # High number - early stopping will prevent overfit max_depth=6, learning_rate=0.1, subsample=0.8, colsample_bytree=0.8, random_state=42, use_label_encoder=False, eval_metric='auc') model.fit( X_train, y_train, eval_set=[(X_val, y_val)], early_stopping_rounds=50, # Stop if no improvement for 50 rounds verbose=True) print(f"Best iteration: {model.best_iteration}")print(f"Best score: {model.best_score:.4f}") # ===== NEURAL NETWORK WITH CALLBACKS =====import tensorflow as tffrom tensorflow.keras import layers, callbacks # Build modelmodel = tf.keras.Sequential([ layers.Dense(128, activation='relu', input_shape=(X_train.shape[1],)), layers.Dropout(0.3), layers.Dense(64, activation='relu'), layers.Dropout(0.3), layers.Dense(1, activation='sigmoid')]) model.compile( optimizer=tf.keras.optimizers.Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['AUC']) # Callbacks for training controltraining_callbacks = [ callbacks.EarlyStopping( monitor='val_auc', patience=10, mode='max', restore_best_weights=True ), callbacks.ReduceLROnPlateau( monitor='val_loss', factor=0.5, patience=5, min_lr=1e-6 ), callbacks.ModelCheckpoint( 'best_model.keras', monitor='val_auc', save_best_only=True, mode='max' )] history = model.fit( X_train, y_train, validation_data=(X_val, y_val), epochs=100, batch_size=32, callbacks=training_callbacks, verbose=1)Set random seeds everywhere: Python's random, NumPy, and framework-specific seeds. Without reproducibility, you can't debug effectively or confidently compare experiments.
The central challenge of machine learning is generalization—learning patterns that apply beyond the training data. Overfitting and underfitting are failures of generalization.
Underfitting: The model is too simple to capture the underlying pattern. Both training and validation performance are poor.
Overfitting: The model memorizes training data, including noise. Training performance is excellent; validation performance is poor.
Diagnosing the Condition:
| Symptom | Diagnosis | Root Cause | Remedies |
|---|---|---|---|
| High training error, high validation error | Underfitting | Model too simple, features not informative | More complex model, better features, longer training |
| Low training error, high validation error | Overfitting | Model too complex, not enough data, training too long | Regularization, more data, early stopping, simpler model |
| Low training error, slightly higher validation error | Healthy | Good generalization | Fine-tune, but avoid over-optimizing |
| Training/validation error decreasing together | Still learning | Model not converged | Continue training, monitor for divergence |
Regularization Techniques:
Regularization is any technique that prevents overfitting by constraining the model:
| Technique | How It Works | Best For |
|---|---|---|
| L1 (Lasso) | Penalizes sum of absolute weights | Feature selection, sparse models |
| L2 (Ridge) | Penalizes sum of squared weights | Reducing weight magnitudes, preventing exploding weights |
| Dropout | Randomly zero activations during training | Neural networks |
| Early Stopping | Stop training before overfitting | All iterative models |
| Data Augmentation | Create synthetic training examples | Images, text, limited data |
| Cross-Validation | Train on multiple subsets | Robustness, model selection |
| Ensemble Averaging | Average multiple models' predictions | Reducing variance |
Underfitting is high bias (model's assumptions are too strong). Overfitting is high variance (model is too sensitive to training data). The optimal model balances both—complex enough to capture the signal, constrained enough not to memorize the noise.
Cross-validation (CV) provides more reliable model evaluation than a single train/validation split by using all data for both training and validation across multiple folds.
Why Cross-Validation?
| Strategy | Description | When to Use |
|---|---|---|
| K-Fold CV | Split data into K folds, train on K-1, validate on 1, repeat K times | Standard approach, general purpose |
| Stratified K-Fold | K-Fold maintaining class proportions in each fold | Classification, imbalanced data |
| Group K-Fold | Ensure all samples from a group are in the same fold | Per-user predictions, medical data (by patient) |
| Time Series Split | Training folds always precede validation folds temporally | Time series, sequential data |
| Leave-One-Out (LOO) | Each sample is validation set once | Very small datasets only (expensive) |
| Repeated K-Fold | Run K-Fold multiple times with different random splits | More robust estimates, expensive |
| Nested CV | Outer CV for evaluation, inner CV for hyperparameter tuning | Unbiased evaluation of tuned models |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
from sklearn.model_selection import ( cross_val_score, KFold, StratifiedKFold, GroupKFold, TimeSeriesSplit, cross_validate)from sklearn.ensemble import RandomForestClassifierimport numpy as np # ===== BASIC K-FOLD =====model = RandomForestClassifier(n_estimators=100, random_state=42)scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')print(f"AUC: {scores.mean():.4f} (+/- {scores.std():.4f})") # ===== STRATIFIED K-FOLD (for classification) =====cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)scores = cross_val_score(model, X, y, cv=cv, scoring='roc_auc') # ===== GROUP K-FOLD (preserve groups) =====# groups = user_ids or patient_idscv = GroupKFold(n_splits=5)scores = cross_val_score(model, X, y, cv=cv, groups=groups, scoring='roc_auc') # ===== TIME SERIES SPLIT =====# For sequential data - no shuffling, respect time ordercv = TimeSeriesSplit(n_splits=5)scores = cross_val_score(model, X, y, cv=cv, scoring='roc_auc') # ===== GET MULTIPLE METRICS =====scoring = ['accuracy', 'roc_auc', 'f1', 'precision', 'recall']results = cross_validate(model, X, y, cv=5, scoring=scoring, return_train_score=True) for metric in scoring: train_key = f'train_{metric}' test_key = f'test_{metric}' print(f"{metric}: train={results[train_key].mean():.4f}, " f"val={results[test_key].mean():.4f}") # ===== NESTED CV (unbiased evaluation of tuned model) =====from sklearn.model_selection import GridSearchCV # Inner CV for hyperparameter tuningparam_grid = {'n_estimators': [100, 200], 'max_depth': [5, 10]}inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)grid_search = GridSearchCV( RandomForestClassifier(random_state=42), param_grid, cv=inner_cv, scoring='roc_auc') # Outer CV for unbiased evaluationouter_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)nested_scores = cross_val_score(grid_search, X, y, cv=outer_cv, scoring='roc_auc') print(f"Nested CV AUC: {nested_scores.mean():.4f} (+/- {nested_scores.std():.4f})")If you use CV for hyperparameter tuning, the resulting 'best' score is optimistic. Nested CV uses an outer loop for evaluation and inner loop for tuning, providing unbiased performance estimates of tuned models. Use nested CV when reporting final results or comparing different algorithms.
Ensembles combine multiple models to produce better predictions than any single model. They work because different models make different errors; by combining them, errors cancel out.
Why Ensembles Work:
Ensemble Types:
| Method | How It Works | Key Idea | Examples |
|---|---|---|---|
| Bagging | Train models on bootstrap samples, average predictions | Reduce variance via averaging | Random Forest |
| Boosting | Train models sequentially, each focusing on previous errors | Reduce bias via sequential correction | XGBoost, LightGBM, AdaBoost |
| Stacking | Train a meta-model on base model predictions | Learn optimal combination weights | Stacked generalization |
| Voting | Combine predictions by voting (hard) or averaging (soft) | Simple combination | VotingClassifier |
| Blending | Like stacking but uses holdout set instead of CV | Simpler than stacking, risk of overfitting | Competition techniques |
Gradient Boosting Deep Dive:
Gradient boosting (XGBoost, LightGBM, CatBoost) is the dominant approach for tabular data. It works by:
Each iteration focuses on what previous models got wrong. The learning rate controls how much each new tree contributes.
When to Use Ensembles:
When Not to Use:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
from sklearn.ensemble import ( VotingClassifier, StackingClassifier, RandomForestClassifier, GradientBoostingClassifier)from sklearn.linear_model import LogisticRegressionfrom sklearn.svm import SVCimport xgboost as xgbimport lightgbm as lgb # ===== VOTING ENSEMBLE =====# Combine different model typesvoting_clf = VotingClassifier( estimators=[ ('rf', RandomForestClassifier(n_estimators=100, random_state=42)), ('xgb', xgb.XGBClassifier(n_estimators=100, random_state=42)), ('lgb', lgb.LGBMClassifier(n_estimators=100, random_state=42)) ], voting='soft' # 'soft' averages probabilities; 'hard' uses majority vote)voting_clf.fit(X_train, y_train) # ===== STACKING ENSEMBLE =====# Learn optimal combination via meta-modelstacking_clf = StackingClassifier( estimators=[ ('rf', RandomForestClassifier(n_estimators=100, random_state=42)), ('xgb', xgb.XGBClassifier(n_estimators=100, random_state=42)), ('lgb', lgb.LGBMClassifier(n_estimators=100, random_state=42)) ], final_estimator=LogisticRegression(), # Meta-model cv=5, # Use 5-fold CV to generate meta-features passthrough=False # Whether to include original features)stacking_clf.fit(X_train, y_train) # ===== MANUAL BLENDING =====# Train base models on train, predict on holdoutfrom sklearn.model_selection import train_test_split X_train_base, X_holdout, y_train_base, y_holdout = train_test_split( X_train, y_train, test_size=0.2, random_state=42) # Train base modelsrf = RandomForestClassifier(n_estimators=100, random_state=42)xgb_model = xgb.XGBClassifier(n_estimators=100, random_state=42) rf.fit(X_train_base, y_train_base)xgb_model.fit(X_train_base, y_train_base) # Get predictions on holdoutblend_train = np.column_stack([ rf.predict_proba(X_holdout)[:, 1], xgb_model.predict_proba(X_holdout)[:, 1]]) # Train meta-model on holdout predictionsmeta_model = LogisticRegression()meta_model.fit(blend_train, y_holdout) # For final test predictionsblend_test = np.column_stack([ rf.predict_proba(X_test)[:, 1], xgb_model.predict_proba(X_test)[:, 1]])final_predictions = meta_model.predict_proba(blend_test)[:, 1]Ensembling five very similar models provides little benefit. The power comes from diversity—combining models with different architectures, trained on different features, or using different algorithms. Correlation between model errors should be low.
Model selection and training is the heart of ML development—where data becomes predictions. Success requires understanding the algorithm landscape, matching algorithms to problems, and carefully managing the training process. Let's consolidate the key insights:
What's Next:
A trained model is only valuable if it works in practice. The next page covers Evaluation and Deployment—how to rigorously assess model performance, catch failures before production, and successfully bring models from development into real-world operation.
You now understand model selection and training comprehensively: navigating the algorithm landscape, matching algorithms to problems, tuning hyperparameters, managing the training process, preventing overfitting, using cross-validation, and building ensembles. These skills are central to any ML practitioner's toolkit.