Loading learning content...
When you understand that machine learning is about extracting patterns from data, an immediate question arises: what kind of data, and what kind of patterns?
The structure of available data—particularly whether outputs are provided—fundamentally determines what can be learned and how. This leads to the primary taxonomy of machine learning:
These aren't just different algorithms—they're different problem formulations suited to different kinds of tasks, data availabilities, and learning objectives. Mastering ML requires understanding when each paradigm applies and what each can achieve.
This page provides a deep exploration of each learning paradigm. You'll understand the formal problem setup, the key algorithms, the canonical applications, and the practical considerations for each. By the end, you'll be able to recognize which paradigm fits any given machine learning problem.
Supervised learning is the most mature and widely deployed branch of machine learning. The name derives from the 'supervisor' who provides correct answers during training.
Given:
Goal: Learn a function h: X → Y that predicts the label for new, unseen inputs.
Core assumption: The training labels are correct (or at least mostly correct). The 'supervisor' has provided ground truth that the learner should match.
The two main supervised learning tasks are:
Classification: Y is a finite discrete set (categories/classes). The goal is to assign inputs to the correct category.
Regression: Y = ℝ (or ℝⁿ). The goal is to predict a continuous numerical value.
In classification, we learn to assign inputs to one of several predefined categories.
Binary classification: Two classes (e.g., spam/not spam, positive/negative, fraud/legitimate). The output is often interpreted as P(y=1|x)—the probability that input x belongs to class 1.
Multi-class classification: More than two mutually exclusive classes (e.g., digit recognition with 10 classes 0-9, language identification with 100+ languages). Each input belongs to exactly one class.
Multi-label classification: Each input can belong to multiple classes simultaneously (e.g., an image tagged with 'beach,' 'sunset,' and 'people'). Classes are not mutually exclusive.
What makes classification hard?
Canonical algorithms: Logistic regression, decision trees, random forests, support vector machines, neural networks.
Example applications: Email spam filtering, medical diagnosis, image classification, sentiment analysis, fraud detection.
In regression, we learn to predict continuous numerical values.
Simple regression: Predict a single output value (e.g., house price, temperature tomorrow).
Multi-output regression: Predict multiple values simultaneously (e.g., predict both latitude and longitude, or all parameters of a physical system).
What makes regression hard?
Canonical algorithms: Linear regression, polynomial regression, ridge/lasso regression, decision tree regressors, neural networks, Gaussian processes.
Example applications: Stock price prediction, demand forecasting, housing valuation, physical parameter estimation, age estimation from photos.
Supervised learning works because labels encode human knowledge. The supervisor has solved the problem—recognizing cats, diagnosing diseases, valuing houses—and transferred that knowledge through labeled examples. The algorithm generalizes this knowledge to new cases.
The biggest limitation of supervised learning is the need for labeled data. High-quality labels require human effort—often expert human effort for specialized domains. Modern research on semi-supervised learning, weak supervision, and self-supervised learning addresses this bottleneck.
Unsupervised learning operates without labels. The 'supervisor' is absent—there are no correct answers provided. The algorithm must discover structure in the data itself.
Given:
Goal: Discover meaningful structure, patterns, or representations in the data.
But what counts as 'structure'? Unlike supervised learning, there's no single objective—different unsupervised methods seek different kinds of structure:
Clustering partitions data into groups where points within a group are more similar to each other than to points in other groups.
Hard clustering: Each point belongs to exactly one cluster (e.g., k-means, DBSCAN).
Soft clustering: Each point has a probability of belonging to each cluster (e.g., Gaussian mixture models, fuzzy c-means).
Hierarchical clustering: Builds a tree of clusters from fine-grained to coarse (e.g., agglomerative clustering, divisive methods).
What makes clustering hard?
Canonical algorithms: K-means, hierarchical clustering, DBSCAN, Gaussian mixture models, spectral clustering.
Example applications: Customer segmentation, gene expression analysis, document organization, image segmentation.
Dimensionality reduction finds lower-dimensional representations that preserve essential structure.
Linear methods: Project onto linear subspaces (PCA, factor analysis).
Nonlinear methods: Capture curved manifold structure (t-SNE, UMAP, autoencoders).
Why reduce dimensionality?
Canonical algorithms: PCA, t-SNE, UMAP, autoencoders, ICA.
Example applications: Visualization of high-dimensional data, preprocessing for classifiers, compression, face recognition (eigenfaces).
Density estimation learns the probability distribution P(X) that generated the data.
Why estimate density?
Generative models can generate new samples from the learned distribution. Modern deep generative models—VAEs, GANs, diffusion models—can generate realistic images, text, audio, and more.
Canonical algorithms: Kernel density estimation, Gaussian mixture models, variational autoencoders (VAEs), generative adversarial networks (GANs), diffusion models.
Example applications: Image generation, text generation, data augmentation, anomaly detection, drug discovery.
Without labels, evaluation is difficult. 'How well did we cluster?' or 'How good is this representation?' often lacks a clear answer. Unsupervised learning is more open-ended—discovery rather than prediction.
Self-supervised learning blurs the boundary. It creates 'pseudo-labels' from the data itself—for example, predicting masked words in text or rotated image orientations. Technically unsupervised (no human labels), it uses supervised-like objectives. This hybrid has driven recent breakthroughs in NLP (BERT, GPT) and computer vision.
Reinforcement learning (RL) is fundamentally different from both supervised and unsupervised learning. Instead of learning from a static dataset, an RL agent learns through interaction with an environment, receiving rewards as feedback.
Agent: The learner and decision-maker.
Environment: Everything outside the agent—the world it acts in.
State (s): A representation of the current situation.
Action (a): A choice the agent makes.
Reward (r): A scalar signal indicating how good the outcome was.
Policy (π): The agent's strategy—a mapping from states to actions.
The RL Loop:
Goal: Learn a policy π* that maximizes cumulative reward over time.
Sequential decision making: Actions have consequences over time. A good move in chess might pay off 30 moves later. The agent must consider long-term outcomes, not just immediate rewards.
Exploration vs. exploitation: Should the agent do what it knows works well (exploit) or try something new that might be better (explore)? This tradeoff is fundamental and has no analogue in supervised/unsupervised learning.
Credit assignment: When a reward finally arrives, which past actions deserve credit? If you win a game after 100 moves, which moves were brilliant and which were lucky?
No supervisor, only feedback: The agent never sees 'correct' actions—only rewards indicating how well it did. It must discover good behavior through trial and error.
Non-stationary data: The data distribution changes as the agent's policy changes. Learning affects what data is collected, creating complex dynamics.
Value function V(s): Expected cumulative reward from state s following policy π. 'How good is it to be in this state?'
Q-function Q(s, a): Expected cumulative reward from state s, taking action a, then following π. 'How good is it to take this action in this state?'
Discount factor γ (gamma): How much to discount future rewards (0 ≤ γ < 1). Lower γ makes the agent more myopic; higher γ makes it more far-sighted.
Model-based vs. model-free:
On-policy vs. off-policy:
Canonical algorithms: Q-learning, SARSA, policy gradient methods, actor-critic, PPO, SAC, Monte Carlo Tree Search, AlphaZero.
Example applications: Game playing (Go, Chess, Atari), robotics (manipulation, locomotion), autonomous driving, resource management, recommendation systems.
Reinforcement learning is notoriously challenging. Sample efficiency is often poor (millions of interactions needed), training is unstable, hyperparameters are sensitive, and debugging is difficult. The spectacular successes (AlphaGo, game-playing AIs) required enormous compute. Real-world deployment remains limited compared to supervised learning.
Let's systematically compare supervised, unsupervised, and reinforcement learning across key dimensions.
| Dimension | Supervised | Unsupervised | Reinforcement |
|---|---|---|---|
| Data | Labeled examples (x, y) | Unlabeled examples (x) | Interactions (s, a, r, s') |
| Feedback | Correct answer provided | No feedback | Reward signal (delayed, sparse) |
| Goal | Predict y from x | Find structure in X | Maximize cumulative reward |
| Evaluation | Compare to ground truth | Subjective/task-dependent | Total reward obtained |
| Prototypical task | Classification, regression | Clustering, dimensionality reduction | Control, game-playing |
| Data source | Static dataset | Static dataset | Interactive experience |
| Algorithms | SVM, trees, neural nets | K-means, PCA, GMM | Q-learning, policy gradient |
Use supervised learning when:
Use unsupervised learning when:
Use reinforcement learning when:
In practice, the vast majority of deployed ML systems use supervised learning. Its problem formulation is clearest, evaluation is most straightforward, and algorithms are most mature. Unsupervised methods often serve as preprocessing for supervised tasks. RL remains the most difficult to deploy successfully.
The supervised/unsupervised/reinforcement taxonomy is foundational but incomplete. Modern ML includes several important variations and hybrids.
Setting: Many unlabeled examples, few labeled examples.
Idea: Use the structure in unlabeled data to improve learning from limited labels. If unlabeled examples show that certain regions of input space are dense or clustered, this helps decide where decision boundaries should go.
Techniques: Self-training, co-training, consistency regularization, graph-based methods.
Why it matters: Labels are expensive; unlabeled data is cheap. Semi-supervised learning leverages abundant unlabeled data to reduce label requirements.
Setting: Unlabeled data, but create pseudo-labels from the data itself.
Examples:
Why it matters: Self-supervised pretraining on massive unlabeled datasets produces representations that transfer remarkably well to downstream tasks. This is the foundation of modern NLP (GPT, BERT) and increasingly computer vision.
Setting: Leverage knowledge from one task/domain to improve learning on another.
Approach: Pretrain a model on a large dataset (source), then fine-tune on a smaller target dataset.
Why it matters: When target data is limited, transfer learning can dramatically improve performance by reusing representations learned from abundant source data. ImageNet pretraining revolutionized computer vision; LLM pretraining revolutionized NLP.
Setting: Learn multiple related tasks simultaneously.
Idea: Shared representations benefit all tasks. What's learned for one task provides inductive bias for others.
Why it matters: More efficient use of data and compute. Performance gains from positive transfer between tasks.
Setting: The algorithm can request labels for specific examples.
Idea: Choose examples that will be most informative to label—typically examples where the model is uncertain. Achieve good performance with fewer labels.
Why it matters: Labels are expensive. Smart selection of what to label can dramatically reduce annotation costs.
Setting: Learn from expert demonstrations rather than reward signals.
Relationship to RL: Like RL (sequential decision-making) but with supervised-like feedback (expert actions).
Techniques: Behavioral cloning, inverse reinforcement learning, GAIL.
Why it matters: Often easier than RL because demonstrations provide direct supervision. Used in robotics (teach by demonstration) and game AI.
Setting: Data arrives sequentially; the model must learn and predict in an ongoing fashion.
Contrast with batch learning: Traditional ML assumes a fixed dataset. Online learning handles streams and concept drift.
Why it matters: Many real-world applications (recommendations, fraud detection) operate on data streams where batch assumptions fail.
Modern ML increasingly combines elements from multiple paradigms. A large language model might use self-supervised pretraining (unsupervised-like), supervised fine-tuning, and RLHF (reinforcement learning from human feedback). Real systems are often hybrids that defy simple categorization.
Data is central to all machine learning, but its role differs across paradigms.
Labeled data encodes human knowledge about the task. Each (x, y) pair says 'for this input, the correct output is this.' The quality, quantity, and representativeness of labels determines what can be learned.
Data challenges:
The data itself is the problem. Structure must emerge from the data's own regularities—clustering, manifolds, distributional patterns. There's no external signal saying what structure matters.
Data challenges:
Data isn't given—it's generated through interaction. The agent's policy determines what states it visits and what data it collects. This creates a circular dependency: learning requires data, but data quality depends on learning.
Data challenges:
| Aspect | Supervised | Unsupervised | Reinforcement |
|---|---|---|---|
| Data type | Labeled examples | Unlabeled examples | Trajectories (s, a, r, s') |
| Key bottleneck | Label acquisition cost | Defining what structure matters | Sample efficiency |
| Typical quantity | 1K - 1M examples | 10K - 10B examples | 1M - 1B interactions |
| Data quality | Label accuracy critical | Noise tolerance varies | Reward specification critical |
| Distribution shift | Train/test mismatch hurts | Less directly impactful | Sim-to-real gap problematic |
In all paradigms, data is often the practical bottleneck. Algorithmic improvements are easier than data improvements. The best algorithm on poor data usually loses to a decent algorithm on excellent data. Invest in data quality, quantity, and curation.
Let's practice recognizing which paradigm fits various real-world problems.
Medical diagnosis from images: Given X-ray/MRI/CT images labeled by radiologists as 'cancer' or 'no cancer,' train a classifier. This is binary classification—prototypically supervised.
Speech-to-text transcription: Audio recordings paired with accurate transcripts. The model learns to map audio features to text. Supervised learning with structured output.
Predicting customer churn: Historical data on customers who stayed vs. left, with features like usage patterns, tenure, complaints. Binary classification.
Customer segmentation: You have customer data but no predefined segments. Use clustering to discover natural groupings for targeted marketing.
Anomaly detection in network traffic: Normal traffic is abundant; attacks are rare and varied. Learn the distribution of normal traffic; flag outliers.
Topic modeling on documents: Given many documents, discover latent topics without predefined categories. What themes emerge from the data?
Game playing (Go, Chess, Atari): The agent takes actions (moves), observes outcomes (board states), and receives rewards (win/lose/score). Sequential decision-making with delayed rewards.
Robot control: A robot arm must learn to grasp objects. It takes motor commands, observes results, and receives reward for successful grasps. Physical interaction with environment.
Resource allocation in data centers: Allocate compute resources to jobs to maximize throughput. Decisions have sequential consequences; reward is overall efficiency.
Recommendation systems: Often framed as supervised (predict ratings) but can be RL (maximize long-term engagement). The right framing depends on objectives and data.
Autonomous driving: Perception (object detection) is supervised. Planning (what maneuver to make) could be supervised (imitation learning) or RL (optimize driving policy).
Text generation: Modern LLMs use self-supervised pretraining (predict next word), supervised fine-tuning (instruction following), and RLHF—all three paradigms!
Drug discovery: Predicting molecular properties is supervised. Generating new molecules might be unsupervised (generative models) or RL (optimize for desired properties).
Many problems can be formulated in multiple ways. The 'right' paradigm depends on available data, practical constraints, and what you ultimately care about. Choosing the formulation is a critical modeling decision.
We've mapped the terrain of machine learning paradigms. Let's consolidate:
What's next:
We've covered the fundamental definitions, learning from data concepts, the contrast with traditional programming, and the major learning paradigms. The final page of this module explores the historical evolution of machine learning—the intellectual journey from early ideas to modern practice, and how understanding this history illuminates where the field is headed.
You now have a comprehensive map of machine learning paradigms. When facing a new problem, you can recognize which paradigm applies, understand the data requirements, and anticipate the challenges. This taxonomic clarity is essential for effective ML practice.