What Is Ml - Learning Module

Loading content...

0/278

Learning From Data

Data: The Foundation of Machine Learning

If machine learning is the engine, data is the fuel. Every ML system, from simple linear regression to trillion-parameter language models, learns by extracting patterns from data. The quality, quantity, and structure of your data fundamentally determine what your model can learn and how well it will perform.

Unlike traditional programming where developers encode knowledge as rules, ML practitioners encode knowledge as data. The art of machine learning is largely the art of data—collecting it, cleaning it, representing it effectively, and understanding its limitations.

The Data-Centric AI Movement

Andrew Ng's 'Data-Centric AI' movement emphasizes that improving data quality often yields better results than improving algorithms. In many practical applications, switching from a state-of-the-art model to a simpler one with better data produces superior outcomes.

The Anatomy of ML Data

To work with data effectively, we need a precise vocabulary. ML data is typically organized as a matrix where rows represent instances (examples, observations, samples) and columns represent features (attributes, variables, dimensions).

Key Data Terminology
Term	Definition	Example (House Price Dataset)
Instance (Sample)	A single observation or example in the dataset	One house with its characteristics
Feature (Attribute)	A measurable property of an instance	Square footage, number of bedrooms, year built
Label (Target)	The value we want to predict (in supervised learning)	Sale price of the house
Feature Vector	Collection of all features for one instance	[2000 sqft, 3 bedrooms, 1998, ...]
Dataset	Collection of instances with their features (and labels)	10,000 houses with their characteristics and prices
Training Set	Portion of data used to train the model	8,000 houses used for learning patterns
Test Set	Held-out data used to evaluate model performance	2,000 houses never seen during training

Converting Mermaid diagram...

Feature Representation: The Art of Data Engineering

Raw data rarely comes in a form suitable for ML algorithms. Feature engineering—the process of transforming raw data into informative features—is often the difference between a mediocre model and a great one. Before deep learning's rise, feature engineering was considered the most critical skill in ML.

Numerical (continuous) features represent quantities on a numerical scale.

Common Transformations:

Normalization (Min-Max Scaling): Scale to [0, 1] range
- Formula: x' = (x - min) / (max - min)
- Use when: Bounded variables, neural networks
Standardization (Z-Score): Transform to mean=0, std=1
- Formula: x' = (x - μ) / σ
- Use when: Unbounded variables, algorithms sensitive to scale (SVM, logistic regression)
Log Transformation: Compress large ranges, handle skewness
- Formula: x' = log(x + 1)
- Use when: Power-law distributions (income, population)
Binning/Discretization: Convert continuous to categorical
- Example: Age → [0-18, 18-35, 35-55, 55+]
- Use when: Non-linear relationships, reducing noise

Feature Engineering Wisdom

The best features capture domain knowledge. A data scientist who understands that 'time since last transaction' matters for fraud detection will build a better model than one who blindly throws raw timestamps at the algorithm. Deep learning has automated some feature engineering, but domain expertise remains invaluable.

The I.I.D. Assumption

Most ML theory rests on a fundamental assumption: that data points are Independent and Identically Distributed (i.i.d.). This assumption is so important that violating it can completely undermine your model's reliability.

Independent: Each data point is sampled independently; knowing one observation tells you nothing about others.

Violation Example: Time series data where today's stock price depends on yesterday's. Sequential user clicks where one influences the next.

Identically Distributed: All data points come from the same underlying probability distribution.

Violation Example: Training on summer data, testing on winter data. Training on US users, deploying globally.

When I.I.D. Fails

Distribution Shift: Training and deployment data differ

Data Leakage: Future information leaks into training

Temporal Dependence: Sequential data violates independence

Selection Bias: Sample isn't representative of population

Specialized techniques exist for non-i.i.d. scenarios: time series models, transfer learning, domain adaptation.

I.I.D. Violation in PracticeA cautionary tale of distribution shift

Scenario

A hospital develops an ML model to predict patient readmission risk using 2019 data. Model achieves 92% accuracy and is deployed in January 2020.

Outcome

By March 2020, model performance drops to 67% accuracy. The COVID-19 pandemic changed patient populations, treatment protocols, and readmission patterns—violating the 'identically distributed' assumption.

Lesson

This is called 'concept drift' or 'distribution shift.' The world changed, but the model didn't. Solutions include monitoring model performance in production, regular retraining, and building models robust to distribution shift.

Data Quality Dimensions

Not all data is created equal. Data quality directly impacts model quality. Understanding the dimensions of data quality helps diagnose problems and prioritize data improvement efforts.

The 6 Dimensions of Data Quality
Dimension	Definition	ML Impact	Detection Methods
Accuracy	Do values correctly represent reality?	Incorrect labels directly teach wrong patterns	Manual auditing, cross-validation with other sources
Completeness	Are there missing values?	Missingness can be informative or biasing	Count nulls/NaNs per feature; analyze missing patterns
Consistency	Are values consistent across the dataset?	Inconsistent encoding breaks pattern learning	Unique value counts; format validation
Timeliness	Is the data current and relevant?	Stale data may not reflect current patterns	Check timestamps; compare with recent ground truth
Representativeness	Does the data represent the target population?	Biased training → biased predictions	Compare distributions across subgroups
Relevance	Are features informative for the task?	Irrelevant features add noise; hurt performance	Feature importance analysis; domain expertise

The 80/20 Rule of Data Science

Data scientists often report spending 80% of their time on data preparation (cleaning, transforming, understanding) and only 20% on actual modeling. This ratio reflects the reality that data quality work, while unglamorous, drives most real-world ML success.

How Much Data Do You Need?

One of the most common questions in ML: 'How much data do I need?' Unfortunately, there's no simple answer—it depends on the complexity of the problem, the algorithm, and the desired performance level.

Factors Affecting Data Requirements

•Problem Complexity: Distinguishing cats from dogs needs less data than identifying 10,000 species. More classes, more subtle distinctions → more data needed.
•Feature Dimensionality: High-dimensional data needs exponentially more samples (curse of dimensionality). 100 features might need 10x more data than 10 features.
•Model Complexity: Deep neural networks with millions of parameters need millions of examples. Simple linear models can work with hundreds.
•Signal-to-Noise Ratio: Clean, well-labeled data goes further than noisy data. 1,000 carefully labeled examples may beat 100,000 noisy ones.
•Target Performance: 90% accuracy is easier than 99%. Each percentage point often requires exponentially more data.
•Data Augmentation & Transfer Learning: Techniques that artificially expand data or leverage pre-trained models reduce requirements.

Rough Data Requirements by Model Type (Rules of Thumb)
Model Type	Minimum Samples	Recommended Samples	Notes
Linear Regression	10 × features	30-50 × features	Very data-efficient
Logistic Regression	10 × features per class	50 × features per class	Per-class requirement
Decision Trees	100-1,000 total	1,000-10,000	Risk of overfitting with small data
Random Forests	1,000+	10,000+	Benefits from large datasets
Neural Networks (small)	1,000 per class	10,000+ per class	Scales with network size
Deep Learning (images)	1,000 per class	10,000+ per class	Transfer learning helps greatly
Large Language Models	Billions of tokens	Trillions of tokens	Massive scale required

Section Complete

You now understand how data serves as the foundation of machine learning—from the structure of datasets through feature engineering, the i.i.d. assumption, data quality dimensions, and data volume requirements. Next, we'll compare ML to traditional programming to understand this paradigm shift.