Loading learning content...
If you've spent any time in industry machine learning, you've heard the refrain: 80% of ML work is data work. This isn't hyperbole—it's conservative. Data collection and preparation is where projects succeed or fail, where timelines double or triple, and where the difference between academic exercises and production systems becomes painfully clear.
Consider the asymmetry: you can execute state-of-the-art algorithms with a few library calls, but cleaning a dataset requires understanding every quirk of the data generation process, every bug in the logging pipeline, every edge case in user behavior. Algorithms are well-documented; your company's data pipelines are not.
This page teaches you to approach data work systematically—not as a tedious prerequisite to the 'real' ML work, but as the foundation that determines your ceiling. A perfect model trained on flawed data is a flawed model. There are no shortcuts.
By the end of this page, you will understand strategies for data collection, common data quality issues and how to detect them, data cleaning techniques, handling missing values, addressing data imbalance, and constructing reliable train/validation/test splits. You will be equipped to transform raw, messy data into model-ready datasets.
Before you can clean data, you must collect it. The source and method of collection profoundly affect what's possible downstream. Different collection strategies have different costs, biases, and quality characteristics.
Primary Data Collection Sources:
| Source Type | Examples | Advantages | Challenges |
|---|---|---|---|
| Transactional Systems | Purchase logs, user actions, API calls | High volume, already structured, reflects real behavior | May miss context, logging gaps, privacy constraints |
| Sensors & IoT | Telemetry, GPS, industrial sensors | Continuous, objective measurements | Sensor drift, missing readings, calibration issues |
| User-Generated Content | Reviews, posts, uploads | Rich, diverse, captures opinions | Noisy, biased toward vocal users, spam |
| Surveys & Annotations | Crowdsourced labels, expert ratings | Direct answers to your questions | Expensive, slow, annotator biases |
| External Data | APIs, purchased datasets, open data | Augments internal data, new signals | Quality varies, integration complexity, licensing |
| Synthetic Data | Generated via simulation or models | Infinite volume, controllable | May not match real distribution, verification needed |
Active vs. Passive Collection:
Passive collection captures data as a byproduct of normal operations. User clicks are logged; transactions are recorded. It's cheap and scalable but limited to what naturally occurs.
Active collection involves deliberate efforts to gather specific data. Running surveys, creating labeling tasks, or designing experiments to collect edge cases. It's expensive but targeted.
Most ML projects require both: passive collection provides volume; active collection fills gaps and validates labels.
Designing Data Collection:
When you have influence over data collection (which happens more often than people realize), consider:
If there's any chance a signal might be useful for ML, log it now. Adding logging takes days; waiting for enough historical data takes months. The cost of logging something you don't use is trivial compared to the cost of not having data you need.
Data quality isn't a single dimension—it's a multifaceted concept encompassing accuracy, completeness, consistency, and relevance. Poor data quality is insidious: it silently degrades model performance, creates false confidence in results, and generates predictions that look reasonable but are systematically wrong.
Dimensions of Data Quality:
Data Quality Issues and Their Sources:
Data quality problems don't appear randomly—they have systematic causes:
| Issue | Common Causes | Detection Strategies |
|---|---|---|
| Missing Values | Optional fields, data entry errors, logging failures, user drop-off | Count nulls per column; analyze missingness patterns |
| Incorrect Values | Data entry errors, sensor malfunction, currency/unit confusion | Range checks, cross-validation with other sources |
| Duplicate Records | System bugs, retry mechanisms, merge failures | Deduplication on unique identifiers; fuzzy matching |
| Inconsistent Formats | Multiple data sources, schema evolution | Schema validation, format parsing errors |
| Outliers | Real anomalies, measurement errors, data entry errors | Statistical tests, visualization, domain knowledge |
| Label Noise | Annotator errors, ambiguous cases, evolving definitions | Inter-annotator agreement; spot-checking |
No algorithm can overcome fundamentally flawed data. A model trained on data with 10% label noise will have at least 10% error—probably more. Invest in data quality before investing in model complexity.
Once you've identified data quality issues, you must address them. Data cleaning is both an art and a science—there are general principles, but specific decisions depend on domain knowledge and downstream use cases.
Handling Missing Values:
Missing data is ubiquitous. The appropriate treatment depends on why data is missing:
| Strategy | When to Use | Advantages | Disadvantages |
|---|---|---|---|
| Drop rows | Small fraction missing, MCAR | Simple, no bias from imputation | Loses data; biased if not MCAR |
| Drop columns | Column mostly missing or uninformative | Reduces dimensionality | Loses potentially useful signal |
| Mean/Median imputation | Numeric, MCAR, need simplicity | Fast, preserves sample size | Distorts variance, relationships |
| Mode imputation | Categorical, MCAR | Fast, preserves sample size | Distorts category frequencies |
| Model-based imputation | MAR, enough data to build imputer | Uses relationships between features | Complex, can propagate errors |
| Indicator variable | Missingness is informative | Preserves missingness signal | Adds feature; may overfit |
| Domain-specific defaults | Known default makes sense | Incorporates domain knowledge | Requires domain expertise |
Handling Outliers:
Outliers are observations that deviate significantly from the rest of the data. But deviation isn't inherently wrong—outliers can be:
Distinguishing these requires domain knowledge. Strategies include:
Handling Duplicates:
Duplicates can be:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
import pandas as pdimport numpy as npfrom sklearn.impute import SimpleImputer, KNNImputer # Load datadf = pd.read_csv('data.csv') # ===== MISSING VALUE HANDLING ===== # Strategy 1: Drop rows with any missing values (use cautiously)df_no_missing = df.dropna() # Strategy 2: Drop columns with >50% missingthreshold = 0.5cols_to_keep = df.columns[df.isnull().mean() < threshold]df_filtered = df[cols_to_keep] # Strategy 3: Mean imputation for numeric columnsnumeric_cols = df.select_dtypes(include=[np.number]).columnsimputer = SimpleImputer(strategy='mean')df[numeric_cols] = imputer.fit_transform(df[numeric_cols]) # Strategy 4: Mode imputation for categorical columnscategorical_cols = df.select_dtypes(include=['object', 'category']).columnsfor col in categorical_cols: df[col].fillna(df[col].mode()[0], inplace=True) # Strategy 5: KNN imputation (uses similarity between samples)knn_imputer = KNNImputer(n_neighbors=5)df[numeric_cols] = knn_imputer.fit_transform(df[numeric_cols]) # Strategy 6: Add missingness indicatordf['age_missing'] = df['age'].isnull().astype(int) # ===== OUTLIER HANDLING ===== # Winsorization: clip to 1st and 99th percentilefor col in numeric_cols: lower = df[col].quantile(0.01) upper = df[col].quantile(0.99) df[col] = df[col].clip(lower, upper) # ===== DUPLICATE HANDLING ===== # Remove exact duplicatesdf_deduped = df.drop_duplicates() # Remove duplicates based on subset of columns (e.g., keep latest)df_deduped = df.sort_values('timestamp').drop_duplicates( subset=['user_id', 'product_id'], keep='last')Every data cleaning decision is a choice that affects downstream analysis. Document what you did and why. Future you (or your teammates) will need this context when results are questioned or when the process needs to be reproduced.
Beyond cleaning, data often needs transformation to be suitable for ML algorithms. Different algorithms have different assumptions about input data; violating these assumptions degrades performance.
Common Transformations:
| Transformation | Purpose | When to Use | Algorithms That Benefit |
|---|---|---|---|
| Scaling (Min-Max) | Scale features to [0, 1] range | When features have different scales | Neural networks, SVM, KNN |
| Standardization (Z-score) | Center to mean=0, std=1 | When features are normally distributed | Linear regression, PCA, neural networks |
| Log transformation | Reduce skewness, compress range | Right-skewed distributions (income, counts) | Linear models, improves normality |
| Box-Cox transformation | Generalized power transform for normality | When log isn't sufficient | Linear models requiring normality |
| One-hot encoding | Convert categories to binary columns | Nominal (unordered) categories | Most algorithms (except tree-based) |
| Ordinal encoding | Convert ordered categories to integers | Ordinal categories (low/medium/high) | Tree-based algorithms, models aware of order |
| Target encoding | Replace category with target mean | High-cardinality categories | Any algorithm; requires careful regularization |
| Binning/Discretization | Convert continuous to categorical | Non-linear relationships, simplification | Decision trees (benefit less), interpretability |
Numeric Feature Transformations in Depth:
Scaling is crucial for distance-based algorithms. If one feature ranges from 0-1000 and another from 0-1, the first dominates distance calculations. K-Nearest Neighbors, SVM, and neural networks all require scaled inputs.
Log and power transformations address distributional issues. Many real-world quantities (income, transaction amounts, time durations) are heavily right-skewed. Log transformation makes relationships more linear and reduces the influence of extreme values:
original: [1, 10, 100, 1000, 10000]
log10: [0, 1, 2, 3, 4]
Handling Categorical Features:
Categorical features require special treatment because most algorithms expect numeric inputs.
One-hot encoding creates a binary column for each category. It preserves all information but can create many features for high-cardinality variables (e.g., zip codes, product IDs).
Target encoding replaces each category with the mean of the target for that category. It's powerful for high-cardinality features but risks leakage (information from the target bleeds into features). Use leave-one-out or cross-validation schemes to mitigate.
Embedding (for deep learning) learns dense vector representations of categories. Useful when categories have rich structure (words, products with many attributes).
Transformations (scaling, encoding) must be fit only on training data, then applied to validation and test data. Fitting on the full dataset causes data leakage—information from the test set influences the transformation, leading to overoptimistic evaluation.
1234567891011121314151617181920212223242526272829
from sklearn.preprocessing import StandardScaler, OneHotEncoderfrom sklearn.compose import ColumnTransformerfrom sklearn.pipeline import Pipelineimport numpy as np # Define feature typesnumeric_features = ['age', 'income', 'tenure']categorical_features = ['country', 'subscription_type'] # Create preprocessing pipelinepreprocessor = ColumnTransformer( transformers=[ ('num', StandardScaler(), numeric_features), ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features) ]) # CORRECT: Fit on train, transform on train and testX_train_processed = preprocessor.fit_transform(X_train) # fit + transformX_test_processed = preprocessor.transform(X_test) # transform only # WRONG: Fitting on all data causes leakage# X_all_processed = preprocessor.fit_transform(X_all) # DON'T DO THIS # For log transformation (handle zeros)def safe_log(x): return np.log1p(x) # log(1 + x) handles x=0 df['income_log'] = safe_log(df['income'])Class imbalance occurs when one class significantly outnumbers others. It's common in practical problems:
Imbalance causes problems because most algorithms optimize overall accuracy. A model predicting 'not fraud' for every transaction achieves 99% accuracy on data with 1% fraud—but it's useless.
Strategies for Handling Imbalance:
| Strategy | Description | Advantages | Disadvantages |
|---|---|---|---|
| Random undersampling | Randomly remove majority class samples | Fast, reduces training time | May lose important information |
| Random oversampling | Randomly duplicate minority samples | Simple, no information loss | May cause overfitting to duplicates |
| SMOTE | Generate synthetic minority samples via interpolation | Creates new, varied samples | May generate unrealistic samples |
| ADASYN | Focus synthetic samples near decision boundary | Adaptive to hard regions | More complex; may overfit noise |
| Tomek Links | Remove majority samples near boundary | Cleans decision boundary | Removes potentially useful data |
| Cluster-based undersampling | Keep representative majority samples per cluster | Preserves distribution structure | More complex; clustering choices matter |
Class weighting is often the simplest and most effective approach. It requires no data manipulation, integrates naturally into the loss function, and is supported by almost all ML libraries. Try it before more complex resampling techniques.
When Not to 'Fix' Imbalance:
Not all imbalance needs correction:
Correctly splitting data is fundamental to reliable ML evaluation. The goal is to honestly estimate how your model will perform on unseen data from the real world.
The Three Subsets:
Why This Matters:
If you tune hyperparameters on the test set, you're implicitly training on the test set—information leaks from 'unseen' data into your modeling decisions. Your test performance becomes optimistic. When you deploy, actual performance is worse because the real-world truly is unseen.
Never, ever use the test set for any purpose other than final evaluation. Don't look at test predictions to debug. Don't use test performance to choose between models. Don't check the test set 'just to see how we're doing.' Every peek contaminates your evaluation.
Splitting Strategies:
Random splitting works when data points are independent and identically distributed. Shuffle, then split by percentage.
Stratified splitting ensures each split has the same class distribution as the original data. Essential for imbalanced datasets where random splits might exclude rare classes from smaller subsets.
Temporal splitting is required for time-series and sequential data. Training data must precede validation, which precedes test—mimicking how the model will be used (predict the future from the past). Random splitting would allow information from the future to leak into training.
Group splitting is needed when data points aren't independent. If you have multiple transactions per customer, all transactions from a customer should be in the same split. Otherwise, the model could memorize customer-specific patterns and appear to generalize when it's just remembering.
| Scenario | Splitting Strategy | Rationale |
|---|---|---|
| Standard classification/regression, iid data | Random or stratified | No temporal or group structure |
| Imbalanced classes | Stratified | Ensure minority class in all splits |
| Time series forecasting | Temporal (older→newer) | Predict future from past only |
| Per-user predictions | Group by user | Prevent memorizing user patterns |
| Medical images, multiple per patient | Group by patient | Test on unseen patients |
| Fraud detection with known fraudsters | Group by entity + temporal | Unseen entities, respect time |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
from sklearn.model_selection import ( train_test_split, StratifiedKFold, GroupKFold, TimeSeriesSplit) # ===== BASIC RANDOM SPLIT =====X_train, X_temp, y_train, y_temp = train_test_split( X, y, test_size=0.3, random_state=42)X_val, X_test, y_val, y_test = train_test_split( X_temp, y_temp, test_size=0.5, random_state=42)# Result: 70% train, 15% val, 15% test # ===== STRATIFIED SPLIT (maintains class ratios) =====X_train, X_temp, y_train, y_temp = train_test_split( X, y, test_size=0.3, stratify=y, random_state=42) # ===== TEMPORAL SPLIT (for time series) =====# Sort by time firstdf = df.sort_values('timestamp')n = len(df)train_end = int(n * 0.7)val_end = int(n * 0.85) train_df = df.iloc[:train_end]val_df = df.iloc[train_end:val_end]test_df = df.iloc[val_end:] # ===== GROUP SPLIT (e.g., by user) =====# All data from each user stays in one splitfrom sklearn.model_selection import GroupShuffleSplit gss = GroupShuffleSplit(n_splits=1, test_size=0.2, random_state=42)train_idx, test_idx = next(gss.split(X, y, groups=user_ids)) # ===== K-FOLD CROSS-VALIDATION =====# When data is limited, use k-fold instead of single splitcv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)for train_idx, val_idx in cv.split(X, y): X_train, X_val = X[train_idx], X[val_idx] y_train, y_val = y[train_idx], y[val_idx] # Train and evaluateWhen data is limited, a single train/val/test split may have high variance—you might get lucky or unlucky with the split. K-fold cross-validation provides more robust estimates by rotating which data is held out. 5-fold or 10-fold is common.
Production ML requires reproducibility. When a model behaves unexpectedly, you need to trace what data it was trained on. When you want to retrain, you need to recreate the exact training set. This requires data versioning and lineage tracking.
Data Versioning:
Like code versioning (git), data versioning tracks changes to datasets over time. Key requirements:
Tools include DVC (Data Version Control), MLflow, Delta Lake, and cloud-specific solutions.
Practical Data Management:
Store raw data separately from processed data: Raw data is immutable and retained. Processing can be rerun.
Version processing code alongside data: A dataset version is specified by the raw data version plus the processing code version.
Include metadata: Store not just data but also schema, statistics, quality metrics, and lineage information.
Automate validation: Run data quality checks automatically before data is used for training. Catch issues early.
Enable rollback: If new data causes problems, you should be able to revert to previous versions.
Minimal Viable Data Documentation:
At minimum, every training dataset should have:
Data versioning seems like overhead when you're moving fast. But after you've spent a week debugging a model and realized the issue was data that changed underneath you, you'll wish you'd invested earlier. Set up versioning before you need it.
Data collection and preparation is where ML projects succeed or fail. It's unglamorous, time-consuming, and requires deep understanding of both the domain and the ML algorithms that will consume the data. Let's consolidate the lessons:
What's Next:
With clean, properly split data in hand, you're ready for Feature Engineering—the art of creating informative representations of your data that enable models to learn effectively. Feature engineering bridges raw data and powerful models, often determining the gap between baseline and state-of-the-art performance.
You now understand the full scope of data work in ML: collection strategies, quality dimensions, cleaning techniques, transformations, imbalance handling, proper splitting, and versioning. This knowledge forms the operational backbone of every successful ML project.