Machine LearningThe ML Pipeline

The Machine Learning Pipeline

LevelBeginner

Duration90 mins

TopicThe ML Pipeline

2 / 5

Data Collection and Preparation

Where ML Success Is Actually Determined

If you've spent any time in industry machine learning, you've heard the refrain: 80% of ML work is data work. This isn't hyperbole—it's conservative. Data collection and preparation is where projects succeed or fail, where timelines double or triple, and where the difference between academic exercises and production systems becomes painfully clear.

Consider the asymmetry: you can execute state-of-the-art algorithms with a few library calls, but cleaning a dataset requires understanding every quirk of the data generation process, every bug in the logging pipeline, every edge case in user behavior. Algorithms are well-documented; your company's data pipelines are not.

This page teaches you to approach data work systematically—not as a tedious prerequisite to the 'real' ML work, but as the foundation that determines your ceiling. A perfect model trained on flawed data is a flawed model. There are no shortcuts.

What You Will Learn

By the end of this page, you will understand strategies for data collection, common data quality issues and how to detect them, data cleaning techniques, handling missing values, addressing data imbalance, and constructing reliable train/validation/test splits. You will be equipped to transform raw, messy data into model-ready datasets.

Data Collection Strategies

Before you can clean data, you must collect it. The source and method of collection profoundly affect what's possible downstream. Different collection strategies have different costs, biases, and quality characteristics.

Primary Data Collection Sources:

Data Collection Sources Comparison
Source Type	Examples	Advantages	Challenges
Transactional Systems	Purchase logs, user actions, API calls	High volume, already structured, reflects real behavior	May miss context, logging gaps, privacy constraints
Sensors & IoT	Telemetry, GPS, industrial sensors	Continuous, objective measurements	Sensor drift, missing readings, calibration issues
User-Generated Content	Reviews, posts, uploads	Rich, diverse, captures opinions	Noisy, biased toward vocal users, spam
Surveys & Annotations	Crowdsourced labels, expert ratings	Direct answers to your questions	Expensive, slow, annotator biases
External Data	APIs, purchased datasets, open data	Augments internal data, new signals	Quality varies, integration complexity, licensing
Synthetic Data	Generated via simulation or models	Infinite volume, controllable	May not match real distribution, verification needed

Active vs. Passive Collection:

Passive collection captures data as a byproduct of normal operations. User clicks are logged; transactions are recorded. It's cheap and scalable but limited to what naturally occurs.

Active collection involves deliberate efforts to gather specific data. Running surveys, creating labeling tasks, or designing experiments to collect edge cases. It's expensive but targeted.

Most ML projects require both: passive collection provides volume; active collection fills gaps and validates labels.

Designing Data Collection:

When you have influence over data collection (which happens more often than people realize), consider:

What signals predict your target? Collect features with plausible predictive power.
What's the data generation mechanism? Understanding how data is created helps anticipate biases and gaps.
What's the latency from event to data availability? Real-time systems need real-time data.
What's the storage and processing cost? High-cardinality, high-frequency data can become expensive.
What privacy and consent constraints apply? GDPR, CCPA, and similar regulations restrict what you can collect and how long you can retain it.

Log Now, Use Later

If there's any chance a signal might be useful for ML, log it now. Adding logging takes days; waiting for enough historical data takes months. The cost of logging something you don't use is trivial compared to the cost of not having data you need.

Understanding Data Quality

Data quality isn't a single dimension—it's a multifaceted concept encompassing accuracy, completeness, consistency, and relevance. Poor data quality is insidious: it silently degrades model performance, creates false confidence in results, and generates predictions that look reasonable but are systematically wrong.

Dimensions of Data Quality:

Data Quality Dimensions

•Accuracy — Does the data correctly represent reality? A user's recorded age of 150 is inaccurate. A GPS coordinate off by 100 meters misrepresents location.
•Completeness — Are all expected values present? Missing fields, dropped records, partial logs all constitute incompleteness.
•Consistency — Does the same entity have the same values across sources? A customer with different addresses in billing and shipping systems has consistency issues.
•Timeliness — Is the data current enough for your use case? Yesterday's stock prices are useless for real-time trading.
•Validity — Does the data conform to expected formats and constraints? An email field containing 'N/A' violates validity.
•Uniqueness — Are entities represented once, or are there duplicates? Duplicate customer records distort analysis.
•Relevance — Is the data actually useful for your task? Having thousands of features means nothing if none predict the target.

Data Quality Issues and Their Sources:

Data quality problems don't appear randomly—they have systematic causes:

Issue	Common Causes	Detection Strategies
Missing Values	Optional fields, data entry errors, logging failures, user drop-off	Count nulls per column; analyze missingness patterns
Incorrect Values	Data entry errors, sensor malfunction, currency/unit confusion	Range checks, cross-validation with other sources
Duplicate Records	System bugs, retry mechanisms, merge failures	Deduplication on unique identifiers; fuzzy matching
Inconsistent Formats	Multiple data sources, schema evolution	Schema validation, format parsing errors
Outliers	Real anomalies, measurement errors, data entry errors	Statistical tests, visualization, domain knowledge
Label Noise	Annotator errors, ambiguous cases, evolving definitions	Inter-annotator agreement; spot-checking

The Garbage In, Garbage Out Principle

No algorithm can overcome fundamentally flawed data. A model trained on data with 10% label noise will have at least 10% error—probably more. Invest in data quality before investing in model complexity.

Data Quality Investigation WorkflowA systematic approach to auditing data quality before modeling

Input

Output

Data Cleaning Techniques

Once you've identified data quality issues, you must address them. Data cleaning is both an art and a science—there are general principles, but specific decisions depend on domain knowledge and downstream use cases.

Handling Missing Values:

Missing data is ubiquitous. The appropriate treatment depends on why data is missing:

MCAR (Missing Completely at Random): Missingness is unrelated to any data. Safe to drop or impute.
MAR (Missing at Random): Missingness depends on observed data (e.g., older users less likely to provide age). Can impute based on other features.
MNAR (Missing Not at Random): Missingness depends on the missing value itself (e.g., high earners less likely to disclose income). Imputation is biased; may need specialized methods.

Missing Value Treatment Strategies
Strategy	When to Use	Advantages	Disadvantages
Drop rows	Small fraction missing, MCAR	Simple, no bias from imputation	Loses data; biased if not MCAR
Drop columns	Column mostly missing or uninformative	Reduces dimensionality	Loses potentially useful signal
Mean/Median imputation	Numeric, MCAR, need simplicity	Fast, preserves sample size	Distorts variance, relationships
Mode imputation	Categorical, MCAR	Fast, preserves sample size	Distorts category frequencies
Model-based imputation	MAR, enough data to build imputer	Uses relationships between features	Complex, can propagate errors
Indicator variable	Missingness is informative	Preserves missingness signal	Adds feature; may overfit
Domain-specific defaults	Known default makes sense	Incorporates domain knowledge	Requires domain expertise

Handling Outliers:

Outliers are observations that deviate significantly from the rest of the data. But deviation isn't inherently wrong—outliers can be:

Genuine extreme values — A billionaire's income is an outlier but correct. Removing it biases the model.
Measurement errors — A sensor glitch produces impossible readings. Should be corrected or removed.
Data entry errors — A misplaced decimal point. Should be corrected.

Distinguishing these requires domain knowledge. Strategies include:

Winsorization: Cap values at percentiles (e.g., clip to 99th percentile)
Log transformation: Reduce the influence of extreme values
Robust scaling: Use median/IQR instead of mean/std
Removal: If confident they're errors, remove them
Separate modeling: Extreme values might need different treatment

Handling Duplicates:

Duplicates can be:

Exact duplicates: Identical across all fields. Usually errors; safe to deduplicate.
Near duplicates: Same entity, slight differences (typos, timestamps). Requires fuzzy matching.
Semantic duplicates: Different records representing the same entity. Requires entity resolution.

data_cleaning_examples.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer
 
# Load data
df = pd.read_csv('data.csv')
 
# ===== MISSING VALUE HANDLING =====
 
# Strategy 1: Drop rows with any missing values (use cautiously)
df_no_missing = df.dropna()
 
# Strategy 2: Drop columns with >50% missing
threshold = 0.5
cols_to_keep = df.columns[df.isnull().mean() < threshold]
df_filtered = df[cols_to_keep]
 
# Strategy 3: Mean imputation for numeric columns
numeric_cols = df.select_dtypes(include=[np.number]).columns
imputer = SimpleImputer(strategy='mean')
df[numeric_cols] = imputer.fit_transform(df[numeric_cols])
 
# Strategy 4: Mode imputation for categorical columns
categorical_cols = df.select_dtypes(include=['object', 'category']).columns
for col in categorical_cols:
    df[col].fillna(df[col].mode()[0], inplace=True)
 
# Strategy 5: KNN imputation (uses similarity between samples)
knn_imputer = KNNImputer(n_neighbors=5)
df[numeric_cols] = knn_imputer.fit_transform(df[numeric_cols])
 
# Strategy 6: Add missingness indicator
df['age_missing'] = df['age'].isnull().astype(int)
 
# ===== OUTLIER HANDLING =====
 
# Winsorization: clip to 1st and 99th percentile
for col in numeric_cols:
    lower = df[col].quantile(0.01)
    upper = df[col].quantile(0.99)
    df[col] = df[col].clip(lower, upper)
 
# ===== DUPLICATE HANDLING =====
 
# Remove exact duplicates
df_deduped = df.drop_duplicates()
 
# Remove duplicates based on subset of columns (e.g., keep latest)
df_deduped = df.sort_values('timestamp').drop_duplicates(
    subset=['user_id', 'product_id'], 
    keep='last'
)

Document Every Decision

Every data cleaning decision is a choice that affects downstream analysis. Document what you did and why. Future you (or your teammates) will need this context when results are questioned or when the process needs to be reproduced.

Data Transformations

Beyond cleaning, data often needs transformation to be suitable for ML algorithms. Different algorithms have different assumptions about input data; violating these assumptions degrades performance.

Common Transformations:

Data Transformation Techniques
Transformation	Purpose	When to Use	Algorithms That Benefit
Scaling (Min-Max)	Scale features to [0, 1] range	When features have different scales	Neural networks, SVM, KNN
Standardization (Z-score)	Center to mean=0, std=1	When features are normally distributed	Linear regression, PCA, neural networks
Log transformation	Reduce skewness, compress range	Right-skewed distributions (income, counts)	Linear models, improves normality
Box-Cox transformation	Generalized power transform for normality	When log isn't sufficient	Linear models requiring normality
One-hot encoding	Convert categories to binary columns	Nominal (unordered) categories	Most algorithms (except tree-based)
Ordinal encoding	Convert ordered categories to integers	Ordinal categories (low/medium/high)	Tree-based algorithms, models aware of order
Target encoding	Replace category with target mean	High-cardinality categories	Any algorithm; requires careful regularization
Binning/Discretization	Convert continuous to categorical	Non-linear relationships, simplification	Decision trees (benefit less), interpretability

Numeric Feature Transformations in Depth:

Scaling is crucial for distance-based algorithms. If one feature ranges from 0-1000 and another from 0-1, the first dominates distance calculations. K-Nearest Neighbors, SVM, and neural networks all require scaled inputs.

Log and power transformations address distributional issues. Many real-world quantities (income, transaction amounts, time durations) are heavily right-skewed. Log transformation makes relationships more linear and reduces the influence of extreme values:

original:     [1, 10, 100, 1000, 10000]
log10:        [0,  1,   2,    3,     4]

Handling Categorical Features:

Categorical features require special treatment because most algorithms expect numeric inputs.

One-hot encoding creates a binary column for each category. It preserves all information but can create many features for high-cardinality variables (e.g., zip codes, product IDs).

Target encoding replaces each category with the mean of the target for that category. It's powerful for high-cardinality features but risks leakage (information from the target bleeds into features). Use leave-one-out or cross-validation schemes to mitigate.

Embedding (for deep learning) learns dense vector representations of categories. Useful when categories have rich structure (words, products with many attributes).

Fit on Train, Transform on All

Transformations (scaling, encoding) must be fit only on training data, then applied to validation and test data. Fitting on the full dataset causes data leakage—information from the test set influences the transformation, leading to overoptimistic evaluation.

transformations_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import numpy as np
 
# Define feature types
numeric_features = ['age', 'income', 'tenure']
categorical_features = ['country', 'subscription_type']
 
# Create preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ]
)
 
# CORRECT: Fit on train, transform on train and test
X_train_processed = preprocessor.fit_transform(X_train)  # fit + transform
X_test_processed = preprocessor.transform(X_test)         # transform only
 
# WRONG: Fitting on all data causes leakage
# X_all_processed = preprocessor.fit_transform(X_all)  # DON'T DO THIS
 
# For log transformation (handle zeros)
def safe_log(x):
    return np.log1p(x)  # log(1 + x) handles x=0
 
df['income_log'] = safe_log(df['income'])

Handling Data Imbalance

Class imbalance occurs when one class significantly outnumbers others. It's common in practical problems:

Fraud detection: <1% transactions are fraudulent
Disease diagnosis: Rare conditions affect few patients
Churn prediction: Most customers don't churn each month
Anomaly detection: By definition, anomalies are rare

Imbalance causes problems because most algorithms optimize overall accuracy. A model predicting 'not fraud' for every transaction achieves 99% accuracy on data with 1% fraud—but it's useless.

Strategies for Handling Imbalance:

Imbalance Handling Approaches

•Use appropriate metrics — Don't use accuracy. Use precision, recall, F1, AUC-PR, or cost-sensitive metrics that account for class frequencies.
•Resampling — Undersample the majority class, oversample the minority class, or both. Changes training distribution but not test distribution.
•Synthetic data generation (SMOTE) — Create synthetic minority samples by interpolating between existing ones. More sophisticated than simple oversampling.
•Class weights — Penalize errors on the minority class more heavily during training. Most libraries support class_weight='balanced'.
•Threshold tuning — Instead of predicting the majority class when probability is uncertain, adjust the decision threshold based on costs.
•Ensemble methods — Train multiple models on balanced subsets and combine predictions. Random forests and boosting often handle imbalance better.
•Anomaly detection framing — If one class is extremely rare, frame as anomaly detection instead of classification.

Resampling Strategies Comparison
Strategy	Description	Advantages	Disadvantages
Random undersampling	Randomly remove majority class samples	Fast, reduces training time	May lose important information
Random oversampling	Randomly duplicate minority samples	Simple, no information loss	May cause overfitting to duplicates
SMOTE	Generate synthetic minority samples via interpolation	Creates new, varied samples	May generate unrealistic samples
ADASYN	Focus synthetic samples near decision boundary	Adaptive to hard regions	More complex; may overfit noise
Tomek Links	Remove majority samples near boundary	Cleans decision boundary	Removes potentially useful data
Cluster-based undersampling	Keep representative majority samples per cluster	Preserves distribution structure	More complex; clustering choices matter

Start with Class Weights

Class weighting is often the simplest and most effective approach. It requires no data manipulation, integrates naturally into the loss function, and is supported by almost all ML libraries. Try it before more complex resampling techniques.

When Not to 'Fix' Imbalance:

Not all imbalance needs correction:

If your metric handles it: AUC-ROC and cost-sensitive metrics account for imbalance naturally.
If the imbalance is real and matters: If 99.9% of transactions are legitimate, your model should reflect that prior.
When calibration matters: Resampling breaks probability calibration. If you need calibrated probabilities (for threshold tuning or combining with costs), avoid resampling or re-calibrate afterward.

Train/Validation/Test Splits

Correctly splitting data is fundamental to reliable ML evaluation. The goal is to honestly estimate how your model will perform on unseen data from the real world.

The Three Subsets:

Training set (~60-80%): Used to fit model parameters. The model sees this data during training.
Validation set (~10-20%): Used for hyperparameter tuning and model selection. The model never trains on this, but you use it to make decisions.
Test set (~10-20%): Used for final evaluation only. Never used for any decision-making during development. Represents the real-world deployment scenario.

Why This Matters:

If you tune hyperparameters on the test set, you're implicitly training on the test set—information leaks from 'unseen' data into your modeling decisions. Your test performance becomes optimistic. When you deploy, actual performance is worse because the real-world truly is unseen.

The Cardinal Sin

Never, ever use the test set for any purpose other than final evaluation. Don't look at test predictions to debug. Don't use test performance to choose between models. Don't check the test set 'just to see how we're doing.' Every peek contaminates your evaluation.

Splitting Strategies:

Random splitting works when data points are independent and identically distributed. Shuffle, then split by percentage.

Stratified splitting ensures each split has the same class distribution as the original data. Essential for imbalanced datasets where random splits might exclude rare classes from smaller subsets.

Temporal splitting is required for time-series and sequential data. Training data must precede validation, which precedes test—mimicking how the model will be used (predict the future from the past). Random splitting would allow information from the future to leak into training.

Group splitting is needed when data points aren't independent. If you have multiple transactions per customer, all transactions from a customer should be in the same split. Otherwise, the model could memorize customer-specific patterns and appear to generalize when it's just remembering.

Splitting Strategy Selection Guide
Scenario	Splitting Strategy	Rationale
Standard classification/regression, iid data	Random or stratified	No temporal or group structure
Imbalanced classes	Stratified	Ensure minority class in all splits
Time series forecasting	Temporal (older→newer)	Predict future from past only
Per-user predictions	Group by user	Prevent memorizing user patterns
Medical images, multiple per patient	Group by patient	Test on unseen patients
Fraud detection with known fraudsters	Group by entity + temporal	Unseen entities, respect time

splitting_examples.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
from sklearn.model_selection import (
    train_test_split, 
    StratifiedKFold, 
    GroupKFold,
    TimeSeriesSplit
)
 
# ===== BASIC RANDOM SPLIT =====
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.3, random_state=42
)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42
)
# Result: 70% train, 15% val, 15% test
 
# ===== STRATIFIED SPLIT (maintains class ratios) =====
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)
 
# ===== TEMPORAL SPLIT (for time series) =====
# Sort by time first
df = df.sort_values('timestamp')
n = len(df)
train_end = int(n * 0.7)
val_end = int(n * 0.85)
 
train_df = df.iloc[:train_end]
val_df = df.iloc[train_end:val_end]
test_df = df.iloc[val_end:]
 
# ===== GROUP SPLIT (e.g., by user) =====
# All data from each user stays in one split
from sklearn.model_selection import GroupShuffleSplit
 
gss = GroupShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
train_idx, test_idx = next(gss.split(X, y, groups=user_ids))
 
# ===== K-FOLD CROSS-VALIDATION =====
# When data is limited, use k-fold instead of single split
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, val_idx in cv.split(X, y):
    X_train, X_val = X[train_idx], X[val_idx]
    y_train, y_val = y[train_idx], y[val_idx]
    # Train and evaluate

Cross-Validation for Small Datasets

When data is limited, a single train/val/test split may have high variance—you might get lucky or unlucky with the split. K-fold cross-validation provides more robust estimates by rotating which data is held out. 5-fold or 10-fold is common.

Data Versioning and Lineage

Production ML requires reproducibility. When a model behaves unexpectedly, you need to trace what data it was trained on. When you want to retrain, you need to recreate the exact training set. This requires data versioning and lineage tracking.

Data Versioning:

Like code versioning (git), data versioning tracks changes to datasets over time. Key requirements:

Immutability: Once a dataset version is created, it doesn't change
Identifiability: Each version has a unique identifier
Reproducibility: Given the identifier, you can retrieve the exact data
Efficiency: Don't store full copies of huge datasets for every version

Tools include DVC (Data Version Control), MLflow, Delta Lake, and cloud-specific solutions.

Data Lineage Key Concepts

•Source lineage — Where did this data originally come from? Which database, API, file?
•Transformation lineage — What operations were applied? Which code, which parameters?
•Temporal lineage — What time range does this data cover? When was it extracted?
•Usage lineage — Which models were trained on this data? Which experiments?
•Dependency lineage — If upstream data changes, what downstream artifacts are affected?

Practical Data Management:

Store raw data separately from processed data: Raw data is immutable and retained. Processing can be rerun.
Version processing code alongside data: A dataset version is specified by the raw data version plus the processing code version.
Include metadata: Store not just data but also schema, statistics, quality metrics, and lineage information.
Automate validation: Run data quality checks automatically before data is used for training. Catch issues early.
Enable rollback: If new data causes problems, you should be able to revert to previous versions.

Minimal Viable Data Documentation:

At minimum, every training dataset should have:

Unique identifier and version
Creation timestamp
Source description
Row count and column schema
Link to processing code/pipeline
Known quality issues or limitations

Invest Early

Data versioning seems like overhead when you're moving fast. But after you've spent a week debugging a model and realized the issue was data that changed underneath you, you'll wish you'd invested earlier. Set up versioning before you need it.

Summary: Mastering Data Work

Data collection and preparation is where ML projects succeed or fail. It's unglamorous, time-consuming, and requires deep understanding of both the domain and the ML algorithms that will consume the data. Let's consolidate the lessons:

Key Takeaways

•Data collection requires strategy — Know your sources, their biases, and their costs. Log early, even before you're sure you need the data.
•Data quality has multiple dimensions — Accuracy, completeness, consistency, timeliness, validity, and relevance all matter. Poor quality limits model ceiling.
•Cleaning is decision-making — Every choice (how to impute, whether to remove, how to transform) affects results. Document decisions and rationale.
•Transformations must respect train/test boundaries — Fit on training data only; apply to all data. Violations cause leakage.
•Imbalance requires appropriate handling — Use proper metrics, class weights, or resampling. Don't let models hide behind accuracy.
•Splits must match deployment scenario — Temporal, grouped, or stratified splits as appropriate. The test set must be truly unseen.
•Version and document everything — Reproducibility requires knowing exactly what data trained each model.

What's Next:

With clean, properly split data in hand, you're ready for Feature Engineering—the art of creating informative representations of your data that enable models to learn effectively. Feature engineering bridges raw data and powerful models, often determining the gap between baseline and state-of-the-art performance.

Page Complete

You now understand the full scope of data work in ML: collection strategies, quality dimensions, cleaning techniques, transformations, imbalance handling, proper splitting, and versioning. This knowledge forms the operational backbone of every successful ML project.

2 / 5

Loading learning content...

Machine LearningThe ML Pipeline

The Machine Learning Pipeline

LevelBeginner

Duration90 mins

TopicThe ML Pipeline

2 / 5

Data Collection and Preparation

Where ML Success Is Actually Determined

What You Will Learn

Data Collection Strategies

Primary Data Collection Sources:

Data Collection Sources Comparison
Source Type	Examples	Advantages	Challenges
Transactional Systems	Purchase logs, user actions, API calls	High volume, already structured, reflects real behavior	May miss context, logging gaps, privacy constraints
Sensors & IoT	Telemetry, GPS, industrial sensors	Continuous, objective measurements	Sensor drift, missing readings, calibration issues
User-Generated Content	Reviews, posts, uploads	Rich, diverse, captures opinions	Noisy, biased toward vocal users, spam
Surveys & Annotations	Crowdsourced labels, expert ratings	Direct answers to your questions	Expensive, slow, annotator biases
External Data	APIs, purchased datasets, open data	Augments internal data, new signals	Quality varies, integration complexity, licensing
Synthetic Data	Generated via simulation or models	Infinite volume, controllable	May not match real distribution, verification needed

Active vs. Passive Collection:

Passive collection captures data as a byproduct of normal operations. User clicks are logged; transactions are recorded. It's cheap and scalable but limited to what naturally occurs.

Active collection involves deliberate efforts to gather specific data. Running surveys, creating labeling tasks, or designing experiments to collect edge cases. It's expensive but targeted.

Most ML projects require both: passive collection provides volume; active collection fills gaps and validates labels.

Designing Data Collection:

When you have influence over data collection (which happens more often than people realize), consider:

What signals predict your target? Collect features with plausible predictive power.
What's the data generation mechanism? Understanding how data is created helps anticipate biases and gaps.
What's the latency from event to data availability? Real-time systems need real-time data.
What's the storage and processing cost? High-cardinality, high-frequency data can become expensive.
What privacy and consent constraints apply? GDPR, CCPA, and similar regulations restrict what you can collect and how long you can retain it.

Log Now, Use Later

Understanding Data Quality

Dimensions of Data Quality:

Data Quality Dimensions

•Accuracy — Does the data correctly represent reality? A user's recorded age of 150 is inaccurate. A GPS coordinate off by 100 meters misrepresents location.
•Completeness — Are all expected values present? Missing fields, dropped records, partial logs all constitute incompleteness.
•Consistency — Does the same entity have the same values across sources? A customer with different addresses in billing and shipping systems has consistency issues.
•Timeliness — Is the data current enough for your use case? Yesterday's stock prices are useless for real-time trading.
•Validity — Does the data conform to expected formats and constraints? An email field containing 'N/A' violates validity.
•Uniqueness — Are entities represented once, or are there duplicates? Duplicate customer records distort analysis.
•Relevance — Is the data actually useful for your task? Having thousands of features means nothing if none predict the target.

Data Quality Issues and Their Sources:

Data quality problems don't appear randomly—they have systematic causes:

Issue	Common Causes	Detection Strategies
Missing Values	Optional fields, data entry errors, logging failures, user drop-off	Count nulls per column; analyze missingness patterns
Incorrect Values	Data entry errors, sensor malfunction, currency/unit confusion	Range checks, cross-validation with other sources
Duplicate Records	System bugs, retry mechanisms, merge failures	Deduplication on unique identifiers; fuzzy matching
Inconsistent Formats	Multiple data sources, schema evolution	Schema validation, format parsing errors
Outliers	Real anomalies, measurement errors, data entry errors	Statistical tests, visualization, domain knowledge
Label Noise	Annotator errors, ambiguous cases, evolving definitions	Inter-annotator agreement; spot-checking

The Garbage In, Garbage Out Principle

Data Quality Investigation WorkflowA systematic approach to auditing data quality before modeling

Input

Output

Data Cleaning Techniques

Handling Missing Values:

Missing data is ubiquitous. The appropriate treatment depends on why data is missing:

MCAR (Missing Completely at Random): Missingness is unrelated to any data. Safe to drop or impute.
MAR (Missing at Random): Missingness depends on observed data (e.g., older users less likely to provide age). Can impute based on other features.
MNAR (Missing Not at Random): Missingness depends on the missing value itself (e.g., high earners less likely to disclose income). Imputation is biased; may need specialized methods.

Missing Value Treatment Strategies
Strategy	When to Use	Advantages	Disadvantages
Drop rows	Small fraction missing, MCAR	Simple, no bias from imputation	Loses data; biased if not MCAR
Drop columns	Column mostly missing or uninformative	Reduces dimensionality	Loses potentially useful signal
Mean/Median imputation	Numeric, MCAR, need simplicity	Fast, preserves sample size	Distorts variance, relationships
Mode imputation	Categorical, MCAR	Fast, preserves sample size	Distorts category frequencies
Model-based imputation	MAR, enough data to build imputer	Uses relationships between features	Complex, can propagate errors
Indicator variable	Missingness is informative	Preserves missingness signal	Adds feature; may overfit
Domain-specific defaults	Known default makes sense	Incorporates domain knowledge	Requires domain expertise

Handling Outliers:

Outliers are observations that deviate significantly from the rest of the data. But deviation isn't inherently wrong—outliers can be:

Genuine extreme values — A billionaire's income is an outlier but correct. Removing it biases the model.
Measurement errors — A sensor glitch produces impossible readings. Should be corrected or removed.
Data entry errors — A misplaced decimal point. Should be corrected.

Distinguishing these requires domain knowledge. Strategies include:

Winsorization: Cap values at percentiles (e.g., clip to 99th percentile)
Log transformation: Reduce the influence of extreme values
Robust scaling: Use median/IQR instead of mean/std
Removal: If confident they're errors, remove them
Separate modeling: Extreme values might need different treatment

Handling Duplicates:

Duplicates can be:

Exact duplicates: Identical across all fields. Usually errors; safe to deduplicate.
Near duplicates: Same entity, slight differences (typos, timestamps). Requires fuzzy matching.
Semantic duplicates: Different records representing the same entity. Requires entity resolution.

data_cleaning_examples.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer
 
# Load data
df = pd.read_csv('data.csv')
 
# ===== MISSING VALUE HANDLING =====
 
# Strategy 1: Drop rows with any missing values (use cautiously)
df_no_missing = df.dropna()
 
# Strategy 2: Drop columns with >50% missing
threshold = 0.5
cols_to_keep = df.columns[df.isnull().mean() < threshold]
df_filtered = df[cols_to_keep]
 
# Strategy 3: Mean imputation for numeric columns
numeric_cols = df.select_dtypes(include=[np.number]).columns
imputer = SimpleImputer(strategy='mean')
df[numeric_cols] = imputer.fit_transform(df[numeric_cols])
 
# Strategy 4: Mode imputation for categorical columns
categorical_cols = df.select_dtypes(include=['object', 'category']).columns
for col in categorical_cols:
    df[col].fillna(df[col].mode()[0], inplace=True)
 
# Strategy 5: KNN imputation (uses similarity between samples)
knn_imputer = KNNImputer(n_neighbors=5)
df[numeric_cols] = knn_imputer.fit_transform(df[numeric_cols])
 
# Strategy 6: Add missingness indicator
df['age_missing'] = df['age'].isnull().astype(int)
 
# ===== OUTLIER HANDLING =====
 
# Winsorization: clip to 1st and 99th percentile
for col in numeric_cols:
    lower = df[col].quantile(0.01)
    upper = df[col].quantile(0.99)
    df[col] = df[col].clip(lower, upper)
 
# ===== DUPLICATE HANDLING =====
 
# Remove exact duplicates
df_deduped = df.drop_duplicates()
 
# Remove duplicates based on subset of columns (e.g., keep latest)
df_deduped = df.sort_values('timestamp').drop_duplicates(
    subset=['user_id', 'product_id'], 
    keep='last'
)

Document Every Decision

Data Transformations

Beyond cleaning, data often needs transformation to be suitable for ML algorithms. Different algorithms have different assumptions about input data; violating these assumptions degrades performance.

Common Transformations:

Data Transformation Techniques
Transformation	Purpose	When to Use	Algorithms That Benefit
Scaling (Min-Max)	Scale features to [0, 1] range	When features have different scales	Neural networks, SVM, KNN
Standardization (Z-score)	Center to mean=0, std=1	When features are normally distributed	Linear regression, PCA, neural networks
Log transformation	Reduce skewness, compress range	Right-skewed distributions (income, counts)	Linear models, improves normality
Box-Cox transformation	Generalized power transform for normality	When log isn't sufficient	Linear models requiring normality
One-hot encoding	Convert categories to binary columns	Nominal (unordered) categories	Most algorithms (except tree-based)
Ordinal encoding	Convert ordered categories to integers	Ordinal categories (low/medium/high)	Tree-based algorithms, models aware of order
Target encoding	Replace category with target mean	High-cardinality categories	Any algorithm; requires careful regularization
Binning/Discretization	Convert continuous to categorical	Non-linear relationships, simplification	Decision trees (benefit less), interpretability

Numeric Feature Transformations in Depth:

original:     [1, 10, 100, 1000, 10000]
log10:        [0,  1,   2,    3,     4]

Handling Categorical Features:

Categorical features require special treatment because most algorithms expect numeric inputs.

One-hot encoding creates a binary column for each category. It preserves all information but can create many features for high-cardinality variables (e.g., zip codes, product IDs).

Embedding (for deep learning) learns dense vector representations of categories. Useful when categories have rich structure (words, products with many attributes).

Fit on Train, Transform on All

transformations_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import numpy as np
 
# Define feature types
numeric_features = ['age', 'income', 'tenure']
categorical_features = ['country', 'subscription_type']
 
# Create preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ]
)
 
# CORRECT: Fit on train, transform on train and test
X_train_processed = preprocessor.fit_transform(X_train)  # fit + transform
X_test_processed = preprocessor.transform(X_test)         # transform only
 
# WRONG: Fitting on all data causes leakage
# X_all_processed = preprocessor.fit_transform(X_all)  # DON'T DO THIS
 
# For log transformation (handle zeros)
def safe_log(x):
    return np.log1p(x)  # log(1 + x) handles x=0
 
df['income_log'] = safe_log(df['income'])

Handling Data Imbalance

Class imbalance occurs when one class significantly outnumbers others. It's common in practical problems:

Fraud detection: <1% transactions are fraudulent
Disease diagnosis: Rare conditions affect few patients
Churn prediction: Most customers don't churn each month
Anomaly detection: By definition, anomalies are rare

Imbalance causes problems because most algorithms optimize overall accuracy. A model predicting 'not fraud' for every transaction achieves 99% accuracy on data with 1% fraud—but it's useless.

Strategies for Handling Imbalance:

Imbalance Handling Approaches

•Use appropriate metrics — Don't use accuracy. Use precision, recall, F1, AUC-PR, or cost-sensitive metrics that account for class frequencies.
•Resampling — Undersample the majority class, oversample the minority class, or both. Changes training distribution but not test distribution.
•Synthetic data generation (SMOTE) — Create synthetic minority samples by interpolating between existing ones. More sophisticated than simple oversampling.
•Class weights — Penalize errors on the minority class more heavily during training. Most libraries support class_weight='balanced'.
•Threshold tuning — Instead of predicting the majority class when probability is uncertain, adjust the decision threshold based on costs.
•Ensemble methods — Train multiple models on balanced subsets and combine predictions. Random forests and boosting often handle imbalance better.
•Anomaly detection framing — If one class is extremely rare, frame as anomaly detection instead of classification.

Resampling Strategies Comparison
Strategy	Description	Advantages	Disadvantages
Random undersampling	Randomly remove majority class samples	Fast, reduces training time	May lose important information
Random oversampling	Randomly duplicate minority samples	Simple, no information loss	May cause overfitting to duplicates
SMOTE	Generate synthetic minority samples via interpolation	Creates new, varied samples	May generate unrealistic samples
ADASYN	Focus synthetic samples near decision boundary	Adaptive to hard regions	More complex; may overfit noise
Tomek Links	Remove majority samples near boundary	Cleans decision boundary	Removes potentially useful data
Cluster-based undersampling	Keep representative majority samples per cluster	Preserves distribution structure	More complex; clustering choices matter

Start with Class Weights

When Not to 'Fix' Imbalance:

Not all imbalance needs correction:

If your metric handles it: AUC-ROC and cost-sensitive metrics account for imbalance naturally.
If the imbalance is real and matters: If 99.9% of transactions are legitimate, your model should reflect that prior.
When calibration matters: Resampling breaks probability calibration. If you need calibrated probabilities (for threshold tuning or combining with costs), avoid resampling or re-calibrate afterward.

Train/Validation/Test Splits

Correctly splitting data is fundamental to reliable ML evaluation. The goal is to honestly estimate how your model will perform on unseen data from the real world.

The Three Subsets:

Training set (~60-80%): Used to fit model parameters. The model sees this data during training.
Validation set (~10-20%): Used for hyperparameter tuning and model selection. The model never trains on this, but you use it to make decisions.
Test set (~10-20%): Used for final evaluation only. Never used for any decision-making during development. Represents the real-world deployment scenario.

Why This Matters:

The Cardinal Sin

Splitting Strategies:

Random splitting works when data points are independent and identically distributed. Shuffle, then split by percentage.

Splitting Strategy Selection Guide
Scenario	Splitting Strategy	Rationale
Standard classification/regression, iid data	Random or stratified	No temporal or group structure
Imbalanced classes	Stratified	Ensure minority class in all splits
Time series forecasting	Temporal (older→newer)	Predict future from past only
Per-user predictions	Group by user	Prevent memorizing user patterns
Medical images, multiple per patient	Group by patient	Test on unseen patients
Fraud detection with known fraudsters	Group by entity + temporal	Unseen entities, respect time

splitting_examples.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
from sklearn.model_selection import (
    train_test_split, 
    StratifiedKFold, 
    GroupKFold,
    TimeSeriesSplit
)
 
# ===== BASIC RANDOM SPLIT =====
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.3, random_state=42
)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42
)
# Result: 70% train, 15% val, 15% test
 
# ===== STRATIFIED SPLIT (maintains class ratios) =====
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)
 
# ===== TEMPORAL SPLIT (for time series) =====
# Sort by time first
df = df.sort_values('timestamp')
n = len(df)
train_end = int(n * 0.7)
val_end = int(n * 0.85)
 
train_df = df.iloc[:train_end]
val_df = df.iloc[train_end:val_end]
test_df = df.iloc[val_end:]
 
# ===== GROUP SPLIT (e.g., by user) =====
# All data from each user stays in one split
from sklearn.model_selection import GroupShuffleSplit
 
gss = GroupShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
train_idx, test_idx = next(gss.split(X, y, groups=user_ids))
 
# ===== K-FOLD CROSS-VALIDATION =====
# When data is limited, use k-fold instead of single split
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, val_idx in cv.split(X, y):
    X_train, X_val = X[train_idx], X[val_idx]
    y_train, y_val = y[train_idx], y[val_idx]
    # Train and evaluate

Cross-Validation for Small Datasets

Data Versioning and Lineage

Data Versioning:

Like code versioning (git), data versioning tracks changes to datasets over time. Key requirements:

Immutability: Once a dataset version is created, it doesn't change
Identifiability: Each version has a unique identifier
Reproducibility: Given the identifier, you can retrieve the exact data
Efficiency: Don't store full copies of huge datasets for every version

Tools include DVC (Data Version Control), MLflow, Delta Lake, and cloud-specific solutions.

Data Lineage Key Concepts

•Source lineage — Where did this data originally come from? Which database, API, file?
•Transformation lineage — What operations were applied? Which code, which parameters?
•Temporal lineage — What time range does this data cover? When was it extracted?
•Usage lineage — Which models were trained on this data? Which experiments?
•Dependency lineage — If upstream data changes, what downstream artifacts are affected?

Practical Data Management:

Store raw data separately from processed data: Raw data is immutable and retained. Processing can be rerun.
Version processing code alongside data: A dataset version is specified by the raw data version plus the processing code version.
Include metadata: Store not just data but also schema, statistics, quality metrics, and lineage information.
Automate validation: Run data quality checks automatically before data is used for training. Catch issues early.
Enable rollback: If new data causes problems, you should be able to revert to previous versions.

Minimal Viable Data Documentation:

At minimum, every training dataset should have:

Unique identifier and version
Creation timestamp
Source description
Row count and column schema
Link to processing code/pipeline
Known quality issues or limitations

Invest Early

Summary: Mastering Data Work

Key Takeaways

•Data collection requires strategy — Know your sources, their biases, and their costs. Log early, even before you're sure you need the data.
•Data quality has multiple dimensions — Accuracy, completeness, consistency, timeliness, validity, and relevance all matter. Poor quality limits model ceiling.
•Cleaning is decision-making — Every choice (how to impute, whether to remove, how to transform) affects results. Document decisions and rationale.
•Transformations must respect train/test boundaries — Fit on training data only; apply to all data. Violations cause leakage.
•Imbalance requires appropriate handling — Use proper metrics, class weights, or resampling. Don't let models hide behind accuracy.
•Splits must match deployment scenario — Temporal, grouped, or stratified splits as appropriate. The test set must be truly unseen.
•Version and document everything — Reproducibility requires knowing exactly what data trained each model.

What's Next:

Page Complete

2 / 5