Machine LearningWhat Is Machine Learning?

What Is Machine Learning?

LevelBeginner

Duration60 mins

TopicWhat Is Machine Learning?

4 / 5

Types of Learning: Supervised, Unsupervised, Reinforcement

The Taxonomy of Machine Learning

When you understand that machine learning is about extracting patterns from data, an immediate question arises: what kind of data, and what kind of patterns?

The structure of available data—particularly whether outputs are provided—fundamentally determines what can be learned and how. This leads to the primary taxonomy of machine learning:

Supervised Learning: Learn from labeled examples (input-output pairs)
Unsupervised Learning: Find structure in unlabeled data (inputs only)
Reinforcement Learning: Learn through interaction, guided by rewards

These aren't just different algorithms—they're different problem formulations suited to different kinds of tasks, data availabilities, and learning objectives. Mastering ML requires understanding when each paradigm applies and what each can achieve.

What You Will Learn

This page provides a deep exploration of each learning paradigm. You'll understand the formal problem setup, the key algorithms, the canonical applications, and the practical considerations for each. By the end, you'll be able to recognize which paradigm fits any given machine learning problem.

Supervised Learning

Supervised learning is the most mature and widely deployed branch of machine learning. The name derives from the 'supervisor' who provides correct answers during training.

Formal Setup

Given:

A training set D = {(x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ)}
Each xᵢ ∈ X is an input (features)
Each yᵢ ∈ Y is a label (target/output)
Examples are drawn i.i.d. from unknown P(X, Y)

Goal: Learn a function h: X → Y that predicts the label for new, unseen inputs.

Core assumption: The training labels are correct (or at least mostly correct). The 'supervisor' has provided ground truth that the learner should match.

The two main supervised learning tasks are:

Classification: Y is a finite discrete set (categories/classes). The goal is to assign inputs to the correct category.

Regression: Y = ℝ (or ℝⁿ). The goal is to predict a continuous numerical value.

Classification Deep Dive

In classification, we learn to assign inputs to one of several predefined categories.

Binary classification: Two classes (e.g., spam/not spam, positive/negative, fraud/legitimate). The output is often interpreted as P(y=1|x)—the probability that input x belongs to class 1.

Multi-class classification: More than two mutually exclusive classes (e.g., digit recognition with 10 classes 0-9, language identification with 100+ languages). Each input belongs to exactly one class.

Multi-label classification: Each input can belong to multiple classes simultaneously (e.g., an image tagged with 'beach,' 'sunset,' and 'people'). Classes are not mutually exclusive.

What makes classification hard?

Class boundaries may be complex (nonlinear, disjoint regions)
Classes may be imbalanced (99% negative, 1% positive)
Classes may overlap in feature space (no perfect separator exists)
Labels may be noisy or subjective

Canonical algorithms: Logistic regression, decision trees, random forests, support vector machines, neural networks.

Example applications: Email spam filtering, medical diagnosis, image classification, sentiment analysis, fraud detection.

Regression Deep Dive

In regression, we learn to predict continuous numerical values.

Simple regression: Predict a single output value (e.g., house price, temperature tomorrow).

Multi-output regression: Predict multiple values simultaneously (e.g., predict both latitude and longitude, or all parameters of a physical system).

What makes regression hard?

The function may be highly nonlinear
Outliers can dramatically affect predictions
The true relationship may have heteroscedasticity (varying noise levels)
Extrapolation beyond training range is dangerous

Canonical algorithms: Linear regression, polynomial regression, ridge/lasso regression, decision tree regressors, neural networks, Gaussian processes.

Example applications: Stock price prediction, demand forecasting, housing valuation, physical parameter estimation, age estimation from photos.

The Key Insight of Supervised Learning

Supervised learning works because labels encode human knowledge. The supervisor has solved the problem—recognizing cats, diagnosing diseases, valuing houses—and transferred that knowledge through labeled examples. The algorithm generalizes this knowledge to new cases.

Labels Are Expensive

The biggest limitation of supervised learning is the need for labeled data. High-quality labels require human effort—often expert human effort for specialized domains. Modern research on semi-supervised learning, weak supervision, and self-supervised learning addresses this bottleneck.

Unsupervised Learning

Unsupervised learning operates without labels. The 'supervisor' is absent—there are no correct answers provided. The algorithm must discover structure in the data itself.

Formal Setup

Given:

A dataset D = {x₁, x₂, ..., xₙ}
Each xᵢ ∈ X is an input (no labels y)
Examples are drawn i.i.d. from unknown P(X)

Goal: Discover meaningful structure, patterns, or representations in the data.

But what counts as 'structure'? Unlike supervised learning, there's no single objective—different unsupervised methods seek different kinds of structure:

Clustering: Group similar points together
Dimensionality reduction: Find compact representations
Density estimation: Model the probability distribution P(X)
Anomaly detection: Identify unusual points
Representation learning: Learn useful feature representations

Clustering

Clustering partitions data into groups where points within a group are more similar to each other than to points in other groups.

Hard clustering: Each point belongs to exactly one cluster (e.g., k-means, DBSCAN).

Soft clustering: Each point has a probability of belonging to each cluster (e.g., Gaussian mixture models, fuzzy c-means).

Hierarchical clustering: Builds a tree of clusters from fine-grained to coarse (e.g., agglomerative clustering, divisive methods).

What makes clustering hard?

No ground truth to evaluate against (clustering is subjective)
The 'right' number of clusters is often unknown
Results depend on distance metric choice
Clusters may have complex shapes that simple algorithms miss

Canonical algorithms: K-means, hierarchical clustering, DBSCAN, Gaussian mixture models, spectral clustering.

Example applications: Customer segmentation, gene expression analysis, document organization, image segmentation.

Dimensionality Reduction

Dimensionality reduction finds lower-dimensional representations that preserve essential structure.

Linear methods: Project onto linear subspaces (PCA, factor analysis).

Nonlinear methods: Capture curved manifold structure (t-SNE, UMAP, autoencoders).

Why reduce dimensionality?

Visualization (project to 2-3D for human inspection)
Noise reduction (remove less important variation)
Computational efficiency (faster algorithms on smaller representations)
Feature extraction (learned features for downstream tasks)
The curse of dimensionality (high dimensions need exponentially more data)

Canonical algorithms: PCA, t-SNE, UMAP, autoencoders, ICA.

Example applications: Visualization of high-dimensional data, preprocessing for classifiers, compression, face recognition (eigenfaces).

Density Estimation and Generative Models

Density estimation learns the probability distribution P(X) that generated the data.

Why estimate density?

Sampling: Generate new examples from the learned distribution
Anomaly detection: Low-density points are anomalous
Compression: Efficient codes for likely data
Understanding: What types of data are common vs. rare?

Generative models can generate new samples from the learned distribution. Modern deep generative models—VAEs, GANs, diffusion models—can generate realistic images, text, audio, and more.

Canonical algorithms: Kernel density estimation, Gaussian mixture models, variational autoencoders (VAEs), generative adversarial networks (GANs), diffusion models.

Example applications: Image generation, text generation, data augmentation, anomaly detection, drug discovery.

The Challenge of Unsupervised Learning

Without labels, evaluation is difficult. 'How well did we cluster?' or 'How good is this representation?' often lacks a clear answer. Unsupervised learning is more open-ended—discovery rather than prediction.

Self-Supervised Learning

Self-supervised learning blurs the boundary. It creates 'pseudo-labels' from the data itself—for example, predicting masked words in text or rotated image orientations. Technically unsupervised (no human labels), it uses supervised-like objectives. This hybrid has driven recent breakthroughs in NLP (BERT, GPT) and computer vision.

Reinforcement Learning

Reinforcement learning (RL) is fundamentally different from both supervised and unsupervised learning. Instead of learning from a static dataset, an RL agent learns through interaction with an environment, receiving rewards as feedback.

Formal Setup: The Agent-Environment Interface

Agent: The learner and decision-maker.

Environment: Everything outside the agent—the world it acts in.

State (s): A representation of the current situation.

Action (a): A choice the agent makes.

Reward (r): A scalar signal indicating how good the outcome was.

Policy (π): The agent's strategy—a mapping from states to actions.

The RL Loop:

Agent observes state sₜ
Agent selects action aₜ according to policy π(sₜ)
Environment transitions to new state sₜ₊₁
Agent receives reward rₜ₊₁
Repeat

Goal: Learn a policy π* that maximizes cumulative reward over time.

What Makes RL Unique

Sequential decision making: Actions have consequences over time. A good move in chess might pay off 30 moves later. The agent must consider long-term outcomes, not just immediate rewards.

Exploration vs. exploitation: Should the agent do what it knows works well (exploit) or try something new that might be better (explore)? This tradeoff is fundamental and has no analogue in supervised/unsupervised learning.

Credit assignment: When a reward finally arrives, which past actions deserve credit? If you win a game after 100 moves, which moves were brilliant and which were lucky?

No supervisor, only feedback: The agent never sees 'correct' actions—only rewards indicating how well it did. It must discover good behavior through trial and error.

Non-stationary data: The data distribution changes as the agent's policy changes. Learning affects what data is collected, creating complex dynamics.

Key Concepts in RL

Value function V(s): Expected cumulative reward from state s following policy π. 'How good is it to be in this state?'

Q-function Q(s, a): Expected cumulative reward from state s, taking action a, then following π. 'How good is it to take this action in this state?'

Discount factor γ (gamma): How much to discount future rewards (0 ≤ γ < 1). Lower γ makes the agent more myopic; higher γ makes it more far-sighted.

Model-based vs. model-free:

Model-based: Learn a model of the environment (transition dynamics); plan using the model
Model-free: Learn value functions or policies directly from experience

On-policy vs. off-policy:

On-policy: Learn about the policy currently being followed
Off-policy: Learn about one policy while following another (enables learning from historical data)

Canonical algorithms: Q-learning, SARSA, policy gradient methods, actor-critic, PPO, SAC, Monte Carlo Tree Search, AlphaZero.

Example applications: Game playing (Go, Chess, Atari), robotics (manipulation, locomotion), autonomous driving, resource management, recommendation systems.

RL Is Hard

Reinforcement learning is notoriously challenging. Sample efficiency is often poor (millions of interactions needed), training is unstable, hyperparameters are sensitive, and debugging is difficult. The spectacular successes (AlphaGo, game-playing AIs) required enormous compute. Real-world deployment remains limited compared to supervised learning.

Comparing the Three Paradigms

Let's systematically compare supervised, unsupervised, and reinforcement learning across key dimensions.

The Three Learning Paradigms
Dimension	Supervised	Unsupervised	Reinforcement
Data	Labeled examples (x, y)	Unlabeled examples (x)	Interactions (s, a, r, s')
Feedback	Correct answer provided	No feedback	Reward signal (delayed, sparse)
Goal	Predict y from x	Find structure in X	Maximize cumulative reward
Evaluation	Compare to ground truth	Subjective/task-dependent	Total reward obtained
Prototypical task	Classification, regression	Clustering, dimensionality reduction	Control, game-playing
Data source	Static dataset	Static dataset	Interactive experience
Algorithms	SVM, trees, neural nets	K-means, PCA, GMM	Q-learning, policy gradient

When to Use Each Paradigm

Use supervised learning when:

You have labeled data (or can obtain it)
The task is clearly defined as prediction (classification or regression)
Ground truth exists against which to evaluate
You want mature, well-understood algorithms

Use unsupervised learning when:

Labels are unavailable or impractical to obtain
The goal is exploratory (discover structure, find patterns)
You want to preprocess data for downstream tasks
You need to detect anomalies or generate new samples

Use reinforcement learning when:

The problem is sequential decision-making
Feedback comes as rewards, not correct answers
Exploration of action space is needed
Simulation or safe exploration is possible

Most ML Is Supervised

In practice, the vast majority of deployed ML systems use supervised learning. Its problem formulation is clearest, evaluation is most straightforward, and algorithms are most mature. Unsupervised methods often serve as preprocessing for supervised tasks. RL remains the most difficult to deploy successfully.

Beyond the Basic Taxonomy

The supervised/unsupervised/reinforcement taxonomy is foundational but incomplete. Modern ML includes several important variations and hybrids.

Semi-Supervised Learning

Setting: Many unlabeled examples, few labeled examples.

Idea: Use the structure in unlabeled data to improve learning from limited labels. If unlabeled examples show that certain regions of input space are dense or clustered, this helps decide where decision boundaries should go.

Techniques: Self-training, co-training, consistency regularization, graph-based methods.

Why it matters: Labels are expensive; unlabeled data is cheap. Semi-supervised learning leverages abundant unlabeled data to reduce label requirements.

Self-Supervised Learning

Setting: Unlabeled data, but create pseudo-labels from the data itself.

Examples:

Predict the next word in a sentence (GPT)
Predict masked words (BERT)
Predict whether two image patches come from the same image
Predict image rotation angle

Why it matters: Self-supervised pretraining on massive unlabeled datasets produces representations that transfer remarkably well to downstream tasks. This is the foundation of modern NLP (GPT, BERT) and increasingly computer vision.

Transfer Learning

Setting: Leverage knowledge from one task/domain to improve learning on another.

Approach: Pretrain a model on a large dataset (source), then fine-tune on a smaller target dataset.

Why it matters: When target data is limited, transfer learning can dramatically improve performance by reusing representations learned from abundant source data. ImageNet pretraining revolutionized computer vision; LLM pretraining revolutionized NLP.

Multi-Task Learning

Setting: Learn multiple related tasks simultaneously.

Idea: Shared representations benefit all tasks. What's learned for one task provides inductive bias for others.

Why it matters: More efficient use of data and compute. Performance gains from positive transfer between tasks.

Active Learning

Setting: The algorithm can request labels for specific examples.

Idea: Choose examples that will be most informative to label—typically examples where the model is uncertain. Achieve good performance with fewer labels.

Why it matters: Labels are expensive. Smart selection of what to label can dramatically reduce annotation costs.

Imitation Learning / Learning from Demonstrations

Setting: Learn from expert demonstrations rather than reward signals.

Relationship to RL: Like RL (sequential decision-making) but with supervised-like feedback (expert actions).

Techniques: Behavioral cloning, inverse reinforcement learning, GAIL.

Why it matters: Often easier than RL because demonstrations provide direct supervision. Used in robotics (teach by demonstration) and game AI.

Online Learning

Setting: Data arrives sequentially; the model must learn and predict in an ongoing fashion.

Contrast with batch learning: Traditional ML assumes a fixed dataset. Online learning handles streams and concept drift.

Why it matters: Many real-world applications (recommendations, fraud detection) operate on data streams where batch assumptions fail.

The Boundaries Blur

Modern ML increasingly combines elements from multiple paradigms. A large language model might use self-supervised pretraining (unsupervised-like), supervised fine-tuning, and RLHF (reinforcement learning from human feedback). Real systems are often hybrids that defy simple categorization.

The Role of Data in Each Paradigm

Data is central to all machine learning, but its role differs across paradigms.

Supervised Learning: Data as Encoded Knowledge

Labeled data encodes human knowledge about the task. Each (x, y) pair says 'for this input, the correct output is this.' The quality, quantity, and representativeness of labels determines what can be learned.

Data challenges:

Label quality: How accurate are the labels? Inter-annotator disagreement?
Label cost: Expert labels (medical, legal) are expensive
Class imbalance: Rare classes may have few examples
Distribution shift: Training data may not match deployment distribution

Unsupervised Learning: Data as the Puzzle to Solve

The data itself is the problem. Structure must emerge from the data's own regularities—clustering, manifolds, distributional patterns. There's no external signal saying what structure matters.

Data challenges:

What structure is meaningful vs. spurious?
High dimensionality makes distances unreliable
Evaluation without ground truth is difficult
Sensitivity to noise and outliers

Reinforcement Learning: Data as a Byproduct of Interaction

Data isn't given—it's generated through interaction. The agent's policy determines what states it visits and what data it collects. This creates a circular dependency: learning requires data, but data quality depends on learning.

Data challenges:

Exploration: Must try diverse actions to gather informative data
Sample efficiency: May need millions of interactions to learn
Credit assignment: Delayed rewards make it unclear which data is informative
Simulation vs. reality: Simulated data may not transfer to real-world

Data Requirements Compared

Data Requirements by Paradigm
Aspect	Supervised	Unsupervised	Reinforcement
Data type	Labeled examples	Unlabeled examples	Trajectories (s, a, r, s')
Key bottleneck	Label acquisition cost	Defining what structure matters	Sample efficiency
Typical quantity	1K - 1M examples	10K - 10B examples	1M - 1B interactions
Data quality	Label accuracy critical	Noise tolerance varies	Reward specification critical
Distribution shift	Train/test mismatch hurts	Less directly impactful	Sim-to-real gap problematic

Data Is the Bottleneck

In all paradigms, data is often the practical bottleneck. Algorithmic improvements are easier than data improvements. The best algorithm on poor data usually loses to a decent algorithm on excellent data. Invest in data quality, quantity, and curation.

Mapping Problems to Paradigms

Let's practice recognizing which paradigm fits various real-world problems.

Clearly Supervised Problems

Medical diagnosis from images: Given X-ray/MRI/CT images labeled by radiologists as 'cancer' or 'no cancer,' train a classifier. This is binary classification—prototypically supervised.

Speech-to-text transcription: Audio recordings paired with accurate transcripts. The model learns to map audio features to text. Supervised learning with structured output.

Predicting customer churn: Historical data on customers who stayed vs. left, with features like usage patterns, tenure, complaints. Binary classification.

Clearly Unsupervised Problems

Customer segmentation: You have customer data but no predefined segments. Use clustering to discover natural groupings for targeted marketing.

Anomaly detection in network traffic: Normal traffic is abundant; attacks are rare and varied. Learn the distribution of normal traffic; flag outliers.

Topic modeling on documents: Given many documents, discover latent topics without predefined categories. What themes emerge from the data?

Clearly Reinforcement Learning Problems

Game playing (Go, Chess, Atari): The agent takes actions (moves), observes outcomes (board states), and receives rewards (win/lose/score). Sequential decision-making with delayed rewards.

Robot control: A robot arm must learn to grasp objects. It takes motor commands, observes results, and receives reward for successful grasps. Physical interaction with environment.

Resource allocation in data centers: Allocate compute resources to jobs to maximize throughput. Decisions have sequential consequences; reward is overall efficiency.

Ambiguous or Hybrid Cases

Recommendation systems: Often framed as supervised (predict ratings) but can be RL (maximize long-term engagement). The right framing depends on objectives and data.

Autonomous driving: Perception (object detection) is supervised. Planning (what maneuver to make) could be supervised (imitation learning) or RL (optimize driving policy).

Text generation: Modern LLMs use self-supervised pretraining (predict next word), supervised fine-tuning (instruction following), and RLHF—all three paradigms!

Drug discovery: Predicting molecular properties is supervised. Generating new molecules might be unsupervised (generative models) or RL (optimize for desired properties).

Problem Formulation Is a Choice

Many problems can be formulated in multiple ways. The 'right' paradigm depends on available data, practical constraints, and what you ultimately care about. Choosing the formulation is a critical modeling decision.

Summary: The Learning Landscape

We've mapped the terrain of machine learning paradigms. Let's consolidate:

Key Takeaways

•Supervised learning — Learn from labeled examples to predict outputs for new inputs. Most mature, most deployed, requires labels. Core tasks: classification and regression.
•Unsupervised learning — Find structure in unlabeled data. More open-ended, evaluation is harder. Core tasks: clustering, dimensionality reduction, density estimation.
•Reinforcement learning — Learn through interaction, guided by rewards. Unique: sequential decisions, exploration, delayed feedback. Most challenging to deploy.
•The taxonomy has extensions — Semi-supervised, self-supervised, transfer learning, multi-task learning, active learning, and more. Modern systems often combine paradigms.
•Problem formulation is a choice — Many problems can be cast in multiple paradigms. The right choice depends on available data and ultimate objectives.
•Data is the constant — All paradigms depend critically on data quantity, quality, and appropriateness. Data investment usually beats algorithm investment.

What's next:

We've covered the fundamental definitions, learning from data concepts, the contrast with traditional programming, and the major learning paradigms. The final page of this module explores the historical evolution of machine learning—the intellectual journey from early ideas to modern practice, and how understanding this history illuminates where the field is headed.

Page Complete

You now have a comprehensive map of machine learning paradigms. When facing a new problem, you can recognize which paradigm applies, understand the data requirements, and anticipate the challenges. This taxonomic clarity is essential for effective ML practice.

4 / 5

Loading learning content...

Machine LearningWhat Is Machine Learning?

What Is Machine Learning?

LevelBeginner

Duration60 mins

TopicWhat Is Machine Learning?

4 / 5

Types of Learning: Supervised, Unsupervised, Reinforcement

The Taxonomy of Machine Learning

When you understand that machine learning is about extracting patterns from data, an immediate question arises: what kind of data, and what kind of patterns?

The structure of available data—particularly whether outputs are provided—fundamentally determines what can be learned and how. This leads to the primary taxonomy of machine learning:

Supervised Learning: Learn from labeled examples (input-output pairs)
Unsupervised Learning: Find structure in unlabeled data (inputs only)
Reinforcement Learning: Learn through interaction, guided by rewards

What You Will Learn

Supervised Learning

Supervised learning is the most mature and widely deployed branch of machine learning. The name derives from the 'supervisor' who provides correct answers during training.

Formal Setup

Given:

A training set D = {(x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ)}
Each xᵢ ∈ X is an input (features)
Each yᵢ ∈ Y is a label (target/output)
Examples are drawn i.i.d. from unknown P(X, Y)

Goal: Learn a function h: X → Y that predicts the label for new, unseen inputs.

Core assumption: The training labels are correct (or at least mostly correct). The 'supervisor' has provided ground truth that the learner should match.

The two main supervised learning tasks are:

Classification: Y is a finite discrete set (categories/classes). The goal is to assign inputs to the correct category.

Regression: Y = ℝ (or ℝⁿ). The goal is to predict a continuous numerical value.

Classification Deep Dive

In classification, we learn to assign inputs to one of several predefined categories.

Binary classification: Two classes (e.g., spam/not spam, positive/negative, fraud/legitimate). The output is often interpreted as P(y=1|x)—the probability that input x belongs to class 1.

Multi-label classification: Each input can belong to multiple classes simultaneously (e.g., an image tagged with 'beach,' 'sunset,' and 'people'). Classes are not mutually exclusive.

What makes classification hard?

Class boundaries may be complex (nonlinear, disjoint regions)
Classes may be imbalanced (99% negative, 1% positive)
Classes may overlap in feature space (no perfect separator exists)
Labels may be noisy or subjective

Canonical algorithms: Logistic regression, decision trees, random forests, support vector machines, neural networks.

Example applications: Email spam filtering, medical diagnosis, image classification, sentiment analysis, fraud detection.

Regression Deep Dive

In regression, we learn to predict continuous numerical values.

Simple regression: Predict a single output value (e.g., house price, temperature tomorrow).

Multi-output regression: Predict multiple values simultaneously (e.g., predict both latitude and longitude, or all parameters of a physical system).

What makes regression hard?

The function may be highly nonlinear
Outliers can dramatically affect predictions
The true relationship may have heteroscedasticity (varying noise levels)
Extrapolation beyond training range is dangerous

Canonical algorithms: Linear regression, polynomial regression, ridge/lasso regression, decision tree regressors, neural networks, Gaussian processes.

Example applications: Stock price prediction, demand forecasting, housing valuation, physical parameter estimation, age estimation from photos.

The Key Insight of Supervised Learning

Labels Are Expensive

Unsupervised Learning

Unsupervised learning operates without labels. The 'supervisor' is absent—there are no correct answers provided. The algorithm must discover structure in the data itself.

Formal Setup

Given:

A dataset D = {x₁, x₂, ..., xₙ}
Each xᵢ ∈ X is an input (no labels y)
Examples are drawn i.i.d. from unknown P(X)

Goal: Discover meaningful structure, patterns, or representations in the data.

But what counts as 'structure'? Unlike supervised learning, there's no single objective—different unsupervised methods seek different kinds of structure:

Clustering: Group similar points together
Dimensionality reduction: Find compact representations
Density estimation: Model the probability distribution P(X)
Anomaly detection: Identify unusual points
Representation learning: Learn useful feature representations

Clustering

Clustering partitions data into groups where points within a group are more similar to each other than to points in other groups.

Hard clustering: Each point belongs to exactly one cluster (e.g., k-means, DBSCAN).

Soft clustering: Each point has a probability of belonging to each cluster (e.g., Gaussian mixture models, fuzzy c-means).

Hierarchical clustering: Builds a tree of clusters from fine-grained to coarse (e.g., agglomerative clustering, divisive methods).

What makes clustering hard?

No ground truth to evaluate against (clustering is subjective)
The 'right' number of clusters is often unknown
Results depend on distance metric choice
Clusters may have complex shapes that simple algorithms miss

Canonical algorithms: K-means, hierarchical clustering, DBSCAN, Gaussian mixture models, spectral clustering.

Example applications: Customer segmentation, gene expression analysis, document organization, image segmentation.

Dimensionality Reduction

Dimensionality reduction finds lower-dimensional representations that preserve essential structure.

Linear methods: Project onto linear subspaces (PCA, factor analysis).

Nonlinear methods: Capture curved manifold structure (t-SNE, UMAP, autoencoders).

Why reduce dimensionality?

Visualization (project to 2-3D for human inspection)
Noise reduction (remove less important variation)
Computational efficiency (faster algorithms on smaller representations)
Feature extraction (learned features for downstream tasks)
The curse of dimensionality (high dimensions need exponentially more data)

Canonical algorithms: PCA, t-SNE, UMAP, autoencoders, ICA.

Example applications: Visualization of high-dimensional data, preprocessing for classifiers, compression, face recognition (eigenfaces).

Density Estimation and Generative Models

Density estimation learns the probability distribution P(X) that generated the data.

Why estimate density?

Sampling: Generate new examples from the learned distribution
Anomaly detection: Low-density points are anomalous
Compression: Efficient codes for likely data
Understanding: What types of data are common vs. rare?

Generative models can generate new samples from the learned distribution. Modern deep generative models—VAEs, GANs, diffusion models—can generate realistic images, text, audio, and more.

Canonical algorithms: Kernel density estimation, Gaussian mixture models, variational autoencoders (VAEs), generative adversarial networks (GANs), diffusion models.

Example applications: Image generation, text generation, data augmentation, anomaly detection, drug discovery.

The Challenge of Unsupervised Learning

Self-Supervised Learning

Reinforcement Learning

Formal Setup: The Agent-Environment Interface

Agent: The learner and decision-maker.

Environment: Everything outside the agent—the world it acts in.

State (s): A representation of the current situation.

Action (a): A choice the agent makes.

Reward (r): A scalar signal indicating how good the outcome was.

Policy (π): The agent's strategy—a mapping from states to actions.

The RL Loop:

Agent observes state sₜ
Agent selects action aₜ according to policy π(sₜ)
Environment transitions to new state sₜ₊₁
Agent receives reward rₜ₊₁
Repeat

Goal: Learn a policy π* that maximizes cumulative reward over time.

What Makes RL Unique

Sequential decision making: Actions have consequences over time. A good move in chess might pay off 30 moves later. The agent must consider long-term outcomes, not just immediate rewards.

Credit assignment: When a reward finally arrives, which past actions deserve credit? If you win a game after 100 moves, which moves were brilliant and which were lucky?

No supervisor, only feedback: The agent never sees 'correct' actions—only rewards indicating how well it did. It must discover good behavior through trial and error.

Non-stationary data: The data distribution changes as the agent's policy changes. Learning affects what data is collected, creating complex dynamics.

Key Concepts in RL

Value function V(s): Expected cumulative reward from state s following policy π. 'How good is it to be in this state?'

Q-function Q(s, a): Expected cumulative reward from state s, taking action a, then following π. 'How good is it to take this action in this state?'

Discount factor γ (gamma): How much to discount future rewards (0 ≤ γ < 1). Lower γ makes the agent more myopic; higher γ makes it more far-sighted.

Model-based vs. model-free:

Model-based: Learn a model of the environment (transition dynamics); plan using the model
Model-free: Learn value functions or policies directly from experience

On-policy vs. off-policy:

On-policy: Learn about the policy currently being followed
Off-policy: Learn about one policy while following another (enables learning from historical data)

Canonical algorithms: Q-learning, SARSA, policy gradient methods, actor-critic, PPO, SAC, Monte Carlo Tree Search, AlphaZero.

Example applications: Game playing (Go, Chess, Atari), robotics (manipulation, locomotion), autonomous driving, resource management, recommendation systems.

RL Is Hard

Comparing the Three Paradigms

Let's systematically compare supervised, unsupervised, and reinforcement learning across key dimensions.

The Three Learning Paradigms
Dimension	Supervised	Unsupervised	Reinforcement
Data	Labeled examples (x, y)	Unlabeled examples (x)	Interactions (s, a, r, s')
Feedback	Correct answer provided	No feedback	Reward signal (delayed, sparse)
Goal	Predict y from x	Find structure in X	Maximize cumulative reward
Evaluation	Compare to ground truth	Subjective/task-dependent	Total reward obtained
Prototypical task	Classification, regression	Clustering, dimensionality reduction	Control, game-playing
Data source	Static dataset	Static dataset	Interactive experience
Algorithms	SVM, trees, neural nets	K-means, PCA, GMM	Q-learning, policy gradient

When to Use Each Paradigm

Use supervised learning when:

You have labeled data (or can obtain it)
The task is clearly defined as prediction (classification or regression)
Ground truth exists against which to evaluate
You want mature, well-understood algorithms

Use unsupervised learning when:

Labels are unavailable or impractical to obtain
The goal is exploratory (discover structure, find patterns)
You want to preprocess data for downstream tasks
You need to detect anomalies or generate new samples

Use reinforcement learning when:

The problem is sequential decision-making
Feedback comes as rewards, not correct answers
Exploration of action space is needed
Simulation or safe exploration is possible

Most ML Is Supervised

Beyond the Basic Taxonomy

The supervised/unsupervised/reinforcement taxonomy is foundational but incomplete. Modern ML includes several important variations and hybrids.

Semi-Supervised Learning

Setting: Many unlabeled examples, few labeled examples.

Techniques: Self-training, co-training, consistency regularization, graph-based methods.

Why it matters: Labels are expensive; unlabeled data is cheap. Semi-supervised learning leverages abundant unlabeled data to reduce label requirements.

Self-Supervised Learning

Setting: Unlabeled data, but create pseudo-labels from the data itself.

Examples:

Predict the next word in a sentence (GPT)
Predict masked words (BERT)
Predict whether two image patches come from the same image
Predict image rotation angle

Transfer Learning

Setting: Leverage knowledge from one task/domain to improve learning on another.

Approach: Pretrain a model on a large dataset (source), then fine-tune on a smaller target dataset.

Multi-Task Learning

Setting: Learn multiple related tasks simultaneously.

Idea: Shared representations benefit all tasks. What's learned for one task provides inductive bias for others.

Why it matters: More efficient use of data and compute. Performance gains from positive transfer between tasks.

Active Learning

Setting: The algorithm can request labels for specific examples.

Idea: Choose examples that will be most informative to label—typically examples where the model is uncertain. Achieve good performance with fewer labels.

Why it matters: Labels are expensive. Smart selection of what to label can dramatically reduce annotation costs.

Imitation Learning / Learning from Demonstrations

Setting: Learn from expert demonstrations rather than reward signals.

Relationship to RL: Like RL (sequential decision-making) but with supervised-like feedback (expert actions).

Techniques: Behavioral cloning, inverse reinforcement learning, GAIL.

Why it matters: Often easier than RL because demonstrations provide direct supervision. Used in robotics (teach by demonstration) and game AI.

Online Learning

Setting: Data arrives sequentially; the model must learn and predict in an ongoing fashion.

Contrast with batch learning: Traditional ML assumes a fixed dataset. Online learning handles streams and concept drift.

Why it matters: Many real-world applications (recommendations, fraud detection) operate on data streams where batch assumptions fail.

The Boundaries Blur

The Role of Data in Each Paradigm

Data is central to all machine learning, but its role differs across paradigms.

Supervised Learning: Data as Encoded Knowledge

Data challenges:

Label quality: How accurate are the labels? Inter-annotator disagreement?
Label cost: Expert labels (medical, legal) are expensive
Class imbalance: Rare classes may have few examples
Distribution shift: Training data may not match deployment distribution

Unsupervised Learning: Data as the Puzzle to Solve

The data itself is the problem. Structure must emerge from the data's own regularities—clustering, manifolds, distributional patterns. There's no external signal saying what structure matters.

Data challenges:

What structure is meaningful vs. spurious?
High dimensionality makes distances unreliable
Evaluation without ground truth is difficult
Sensitivity to noise and outliers

Reinforcement Learning: Data as a Byproduct of Interaction

Data challenges:

Exploration: Must try diverse actions to gather informative data
Sample efficiency: May need millions of interactions to learn
Credit assignment: Delayed rewards make it unclear which data is informative
Simulation vs. reality: Simulated data may not transfer to real-world

Data Requirements Compared

Data Requirements by Paradigm
Aspect	Supervised	Unsupervised	Reinforcement
Data type	Labeled examples	Unlabeled examples	Trajectories (s, a, r, s')
Key bottleneck	Label acquisition cost	Defining what structure matters	Sample efficiency
Typical quantity	1K - 1M examples	10K - 10B examples	1M - 1B interactions
Data quality	Label accuracy critical	Noise tolerance varies	Reward specification critical
Distribution shift	Train/test mismatch hurts	Less directly impactful	Sim-to-real gap problematic

Data Is the Bottleneck

Mapping Problems to Paradigms

Let's practice recognizing which paradigm fits various real-world problems.

Clearly Supervised Problems

Medical diagnosis from images: Given X-ray/MRI/CT images labeled by radiologists as 'cancer' or 'no cancer,' train a classifier. This is binary classification—prototypically supervised.

Speech-to-text transcription: Audio recordings paired with accurate transcripts. The model learns to map audio features to text. Supervised learning with structured output.

Predicting customer churn: Historical data on customers who stayed vs. left, with features like usage patterns, tenure, complaints. Binary classification.

Clearly Unsupervised Problems

Customer segmentation: You have customer data but no predefined segments. Use clustering to discover natural groupings for targeted marketing.

Anomaly detection in network traffic: Normal traffic is abundant; attacks are rare and varied. Learn the distribution of normal traffic; flag outliers.

Topic modeling on documents: Given many documents, discover latent topics without predefined categories. What themes emerge from the data?

Clearly Reinforcement Learning Problems

Game playing (Go, Chess, Atari): The agent takes actions (moves), observes outcomes (board states), and receives rewards (win/lose/score). Sequential decision-making with delayed rewards.

Robot control: A robot arm must learn to grasp objects. It takes motor commands, observes results, and receives reward for successful grasps. Physical interaction with environment.

Resource allocation in data centers: Allocate compute resources to jobs to maximize throughput. Decisions have sequential consequences; reward is overall efficiency.

Ambiguous or Hybrid Cases

Recommendation systems: Often framed as supervised (predict ratings) but can be RL (maximize long-term engagement). The right framing depends on objectives and data.

Autonomous driving: Perception (object detection) is supervised. Planning (what maneuver to make) could be supervised (imitation learning) or RL (optimize driving policy).

Text generation: Modern LLMs use self-supervised pretraining (predict next word), supervised fine-tuning (instruction following), and RLHF—all three paradigms!

Drug discovery: Predicting molecular properties is supervised. Generating new molecules might be unsupervised (generative models) or RL (optimize for desired properties).

Problem Formulation Is a Choice

Summary: The Learning Landscape

We've mapped the terrain of machine learning paradigms. Let's consolidate:

Key Takeaways

•Supervised learning — Learn from labeled examples to predict outputs for new inputs. Most mature, most deployed, requires labels. Core tasks: classification and regression.
•Unsupervised learning — Find structure in unlabeled data. More open-ended, evaluation is harder. Core tasks: clustering, dimensionality reduction, density estimation.
•Reinforcement learning — Learn through interaction, guided by rewards. Unique: sequential decisions, exploration, delayed feedback. Most challenging to deploy.
•The taxonomy has extensions — Semi-supervised, self-supervised, transfer learning, multi-task learning, active learning, and more. Modern systems often combine paradigms.
•Problem formulation is a choice — Many problems can be cast in multiple paradigms. The right choice depends on available data and ultimate objectives.
•Data is the constant — All paradigms depend critically on data quantity, quality, and appropriateness. Data investment usually beats algorithm investment.

What's next:

Page Complete

4 / 5