Loading content...
If machine learning is the engine, data is the fuel. Every ML system, from simple linear regression to trillion-parameter language models, learns by extracting patterns from data. The quality, quantity, and structure of your data fundamentally determine what your model can learn and how well it will perform.
Unlike traditional programming where developers encode knowledge as rules, ML practitioners encode knowledge as data. The art of machine learning is largely the art of data—collecting it, cleaning it, representing it effectively, and understanding its limitations.
Andrew Ng's 'Data-Centric AI' movement emphasizes that improving data quality often yields better results than improving algorithms. In many practical applications, switching from a state-of-the-art model to a simpler one with better data produces superior outcomes.
To work with data effectively, we need a precise vocabulary. ML data is typically organized as a matrix where rows represent instances (examples, observations, samples) and columns represent features (attributes, variables, dimensions).
| Term | Definition | Example (House Price Dataset) |
|---|---|---|
| Instance (Sample) | A single observation or example in the dataset | One house with its characteristics |
| Feature (Attribute) | A measurable property of an instance | Square footage, number of bedrooms, year built |
| Label (Target) | The value we want to predict (in supervised learning) | Sale price of the house |
| Feature Vector | Collection of all features for one instance | [2000 sqft, 3 bedrooms, 1998, ...] |
| Dataset | Collection of instances with their features (and labels) | 10,000 houses with their characteristics and prices |
| Training Set | Portion of data used to train the model | 8,000 houses used for learning patterns |
| Test Set | Held-out data used to evaluate model performance | 2,000 houses never seen during training |
Raw data rarely comes in a form suitable for ML algorithms. Feature engineering—the process of transforming raw data into informative features—is often the difference between a mediocre model and a great one. Before deep learning's rise, feature engineering was considered the most critical skill in ML.
Numerical (continuous) features represent quantities on a numerical scale.
Common Transformations:
Normalization (Min-Max Scaling): Scale to [0, 1] range
Standardization (Z-Score): Transform to mean=0, std=1
Log Transformation: Compress large ranges, handle skewness
Binning/Discretization: Convert continuous to categorical
The best features capture domain knowledge. A data scientist who understands that 'time since last transaction' matters for fraud detection will build a better model than one who blindly throws raw timestamps at the algorithm. Deep learning has automated some feature engineering, but domain expertise remains invaluable.
Most ML theory rests on a fundamental assumption: that data points are Independent and Identically Distributed (i.i.d.). This assumption is so important that violating it can completely undermine your model's reliability.
Independent: Each data point is sampled independently; knowing one observation tells you nothing about others.
Violation Example: Time series data where today's stock price depends on yesterday's. Sequential user clicks where one influences the next.
Identically Distributed: All data points come from the same underlying probability distribution.
Violation Example: Training on summer data, testing on winter data. Training on US users, deploying globally.
Distribution Shift: Training and deployment data differ
Data Leakage: Future information leaks into training
Temporal Dependence: Sequential data violates independence
Selection Bias: Sample isn't representative of population
Specialized techniques exist for non-i.i.d. scenarios: time series models, transfer learning, domain adaptation.
A hospital develops an ML model to predict patient readmission risk using 2019 data. Model achieves 92% accuracy and is deployed in January 2020.By March 2020, model performance drops to 67% accuracy. The COVID-19 pandemic changed patient populations, treatment protocols, and readmission patterns—violating the 'identically distributed' assumption.This is called 'concept drift' or 'distribution shift.' The world changed, but the model didn't. Solutions include monitoring model performance in production, regular retraining, and building models robust to distribution shift.
Not all data is created equal. Data quality directly impacts model quality. Understanding the dimensions of data quality helps diagnose problems and prioritize data improvement efforts.
| Dimension | Definition | ML Impact | Detection Methods |
|---|---|---|---|
| Accuracy | Do values correctly represent reality? | Incorrect labels directly teach wrong patterns | Manual auditing, cross-validation with other sources |
| Completeness | Are there missing values? | Missingness can be informative or biasing | Count nulls/NaNs per feature; analyze missing patterns |
| Consistency | Are values consistent across the dataset? | Inconsistent encoding breaks pattern learning | Unique value counts; format validation |
| Timeliness | Is the data current and relevant? | Stale data may not reflect current patterns | Check timestamps; compare with recent ground truth |
| Representativeness | Does the data represent the target population? | Biased training → biased predictions | Compare distributions across subgroups |
| Relevance | Are features informative for the task? | Irrelevant features add noise; hurt performance | Feature importance analysis; domain expertise |
Data scientists often report spending 80% of their time on data preparation (cleaning, transforming, understanding) and only 20% on actual modeling. This ratio reflects the reality that data quality work, while unglamorous, drives most real-world ML success.
One of the most common questions in ML: 'How much data do I need?' Unfortunately, there's no simple answer—it depends on the complexity of the problem, the algorithm, and the desired performance level.
| Model Type | Minimum Samples | Recommended Samples | Notes |
|---|---|---|---|
| Linear Regression | 10 × features | 30-50 × features | Very data-efficient |
| Logistic Regression | 10 × features per class | 50 × features per class | Per-class requirement |
| Decision Trees | 100-1,000 total | 1,000-10,000 | Risk of overfitting with small data |
| Random Forests | 1,000+ | 10,000+ | Benefits from large datasets |
| Neural Networks (small) | 1,000 per class | 10,000+ per class | Scales with network size |
| Deep Learning (images) | 1,000 per class | 10,000+ per class | Transfer learning helps greatly |
| Large Language Models | Billions of tokens | Trillions of tokens | Massive scale required |
You now understand how data serves as the foundation of machine learning—from the structure of datasets through feature engineering, the i.i.d. assumption, data quality dimensions, and data volume requirements. Next, we'll compare ML to traditional programming to understand this paradigm shift.