Machine LearningWhat Is Machine Learning?

What Is Machine Learning?

LevelBeginner

Duration60 mins

TopicWhat Is Machine Learning?

1 / 5

Formal Definition of Machine Learning

What Does It Mean for a Machine to Learn?

When you first heard the term 'machine learning,' you likely formed an intuitive picture: computers getting smarter, systems that improve with experience, or perhaps AI that can recognize faces and understand speech. But as we embark on a serious study of this field—one that will take us through linear algebra, probability theory, optimization, and statistical learning theory—we need something more precise than intuition.

We need a formal definition.

Not because formalism is inherently valuable, but because precision eliminates confusion. When we say a machine 'learns,' what exactly is happening? How do we know learning has occurred? How do we measure it? Without clear answers, we cannot design, analyze, or improve learning systems.

This page establishes the mathematical and conceptual foundations you need to think rigorously about machine learning. We'll move from intuitive understanding to formal definitions, and you'll see that the precision isn't pedantic—it's powerful.

What You Will Learn

By the end of this page, you will understand Tom Mitchell's formal definition of machine learning, the mathematical framework underlying all learning systems, and why this definition matters for everything from debugging models to designing new algorithms. You'll be able to precisely articulate what it means when someone says their system 'learned.'

The Intuitive Picture

Before diving into formalism, let's ground ourselves with examples of what we intuitively recognize as learning:

A child learning to recognize dogs: After seeing many dogs—golden retrievers, poodles, German shepherds—a child can recognize a new breed they've never seen before as a dog. Something has been extracted from the examples that generalizes.

A spam filter improving over time: When users mark certain emails as spam, the filter gets better at catching similar emails in the future. The system's behavior changes based on feedback.

A chess AI that defeats grandmasters: After analyzing millions of games and playing against itself, the system develops strategies no human explicitly programmed. Knowledge emerged from data.

What's common across these examples? Three things:

Common Elements of Learning

•Experience — There is data or feedback that the learner uses. The child sees many dogs. The spam filter receives user corrections. The chess AI processes game records.
•Task — There is something specific the learner should do. The child should recognize dogs. The filter should classify emails. The AI should make good moves.
•Improvement — Performance on the task gets better with more experience. The child makes fewer misidentifications. The filter catches more spam. The AI wins more games.

This intuitive understanding—learning as improvement on a task through experience—is remarkably robust. It captures what we mean when we say something 'learned,' whether that something is a child, an animal, or a computer program.

But intution has limits. Consider these edge cases:

A program that stores every email ever marked as spam and matches new emails exactly against this database—is this learning, or just memorization?
A system that gets better at training data but worse at new data—has it learned anything useful?
A random number generator that happens to produce better predictions for a while—did it learn?

To resolve these ambiguities, we need precision.

Tom Mitchell's Formal Definition

In his seminal 1997 textbook Machine Learning, Tom Mitchell provided the definition that has become the standard foundation for the field:

Definition: Machine Learning

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

This definition is deceptively simple. Let's unpack each component with the precision it deserves:

The Task (T)

The task is what we want the learning system to accomplish. Critically, the task is not 'learning'—learning is the means by which the system becomes capable of performing the task. Common task types include:

Classification: Assign inputs to one of several discrete categories (e.g., spam vs. not-spam, digit recognition)
Regression: Predict a continuous numerical value (e.g., house price, temperature tomorrow)
Clustering: Group similar inputs together without predefined categories
Transcription: Convert unstructured input into discrete text (e.g., speech recognition, OCR)
Anomaly detection: Identify unusual or outlier examples
Synthesis and generation: Create new examples similar to training data
Denoising: Recover clean signals from corrupted inputs
Density estimation: Learn the probability distribution of the input space

The task definition matters enormously. A poorly specified task leads to a model that optimizes for the wrong thing—a problem that plagues real-world ML systems.

The Performance Measure (P)

The performance measure quantifies how well the system performs the task. This might seem straightforward, but choosing P is one of the most critical decisions in any ML project.

The performance measure must be defined independently of the training process. If we measure performance on the same data used for learning, we cannot distinguish genuine improvement from memorization.

For a spam classifier:

Accuracy (% correctly classified) seems natural but can be misleading if 99% of emails are non-spam
Precision (% of predicted spam that is actually spam) matters if false positives are costly
Recall (% of actual spam that is caught) matters if missing spam is costly
F1 score (harmonic mean of precision and recall) balances both concerns

The choice of P shapes what the learning algorithm optimizes for, and thus what behavior emerges. This is why 'Goodhart's Law' haunts ML: when a measure becomes a target, it ceases to be a good measure.

The Experience (E)

Experience is the data or interactions from which the system learns. The nature of experience distinguishes fundamentally different learning paradigms:

Supervised learning: Experience consists of examples where inputs are paired with correct outputs (labels). The system learns the mapping from inputs to outputs.
Unsupervised learning: Experience consists of examples without labels. The system must discover structure, patterns, or representations in the data.
Reinforcement learning: Experience consists of interactions with an environment, where the system takes actions, observes states, and receives rewards. Learning happens through trial and error.

The experience also has mathematical properties that affect learning:

i.i.d. assumption: Often we assume examples are drawn independently from the same distribution—violations break many theoretical guarantees
Sample complexity: How much experience is needed to reach a given performance level
Data quality: Noisy, biased, or corrupted experience leads to flawed learning

The Crucial Insight

Mitchell's definition makes clear that 'learning' is not mystical. It's a precisely measurable phenomenon: performance P on task T improves with experience E. If we cannot measure improvement, we cannot claim learning occurred. If performance doesn't improve, the system hasn't learned—no matter how sophisticated the algorithm.

Mathematical Formalization

To work with machine learning rigorously—to prove theorems, derive algorithms, and analyze behavior—we need mathematical notation. Here's the standard framework that underpins the field:

The Data Space

Let X denote the input space—the set of all possible inputs. For image classification, X might be all possible images of a certain resolution. For spam detection, X might be all possible email texts.

Let Y denote the output space or label space. For binary classification, Y = {0, 1} or {-1, +1}. For multi-class classification with k classes, Y = {1, 2, ..., k}. For regression, Y = ℝ (the real numbers).

The Unknown Target

We assume there exists an unknown target function (or target conditional distribution):

f: X → Y

This is what we're trying to learn. In classification, f(x) gives the true class of input x. In regression, f(x) gives the true output value. We never observe f directly—we only see examples of input-output pairs that f generated (possibly with noise).

The Training Data

Our experience E typically comes as a training set:

D = {(x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ)}

where each (xᵢ, yᵢ) is an example with input xᵢ ∈ X and label yᵢ ∈ Y.

We usually assume examples are drawn i.i.d. (independently and identically distributed) from some fixed but unknown probability distribution P(X, Y). This assumption is fundamental—it says the training data is representative of what we'll encounter at test time.

The Hypothesis Space

A learning algorithm doesn't search the space of all possible functions (that would be infinite and intractable). Instead, it searches within a hypothesis space H—a predefined set of candidate functions:

H = {h : X → Y}

For linear classifiers, H is the set of all linear functions. For decision trees of depth d, H is all such trees. For neural networks of a given architecture, H is all weight configurations.

The choice of H embodies our inductive bias—assumptions about what kinds of functions are likely to be good. If the true target f lies outside H, we cannot learn it exactly, no matter how much data we have.

The Learning Algorithm

A learning algorithm A is a function that takes training data D and produces a hypothesis h ∈ H:

A: D → h ∈ H

Given training data, the algorithm selects the 'best' hypothesis according to some criterion (usually minimizing training error or a related objective).

The Loss Function

To quantify 'best,' we define a loss function (or cost function):

L: Y × Y → ℝ⁺

L(y, ŷ) measures how bad it is to predict ŷ when the true value is y.

Common loss functions:

0-1 loss (classification): L(y, ŷ) = 𝟙[y ≠ ŷ] (1 if wrong, 0 if right)
Squared loss (regression): L(y, ŷ) = (y - ŷ)²
Absolute loss: L(y, ŷ) = |y - ŷ|
Log loss (probabilistic): L(y, p) = -log(p) if y is the true class

Risk: The Goal of Learning

The true risk (or expected loss, or generalization error) of a hypothesis h is:

R(h) = 𝔼_{(x,y)~P}[L(y, h(x))]

This is the expected loss over all possible examples drawn from the true distribution P. This is what we actually care about—performance on future, unseen data.

The problem: we don't know P, so we can't compute R(h) directly.

Empirical Risk: What We Can Compute

Instead, we compute the empirical risk (or training error):

R̂(h) = (1/n) Σᵢ L(yᵢ, h(xᵢ))

This is the average loss on the training data. Most learning algorithms work by minimizing empirical risk—a principle called Empirical Risk Minimization (ERM).

The fundamental question of machine learning: When does minimizing R̂(h) lead to low R(h)? When does performance on training data predict performance on new data? This is the generalization question, and answering it rigorously is what statistical learning theory is all about.

The Generalization Gap

A model can have near-zero empirical risk (training error) while having high true risk (test error). This gap—called overfitting—is the central challenge of machine learning. The model memorized the training data rather than learning the underlying pattern. Everything in ML, from regularization to cross-validation to architecture design, is in some sense about managing this gap.

Why This Definition Matters in Practice

You might wonder: do practitioners actually think in these formal terms? Isn't this just academic abstraction?

The answer is that every practical decision in ML maps to components of this framework. The formal structure isn't overhead—it's the skeleton that everything hangs on.

Debugging with the T-P-E Framework

When a model fails, systematic debugging traces back to the formal components:

Diagnosing ML Failures Through the Formal Framework
Symptom	Possible Cause	Formal Component
High training error, high test error	Model too simple (underfitting)	Hypothesis space H too restrictive
Low training error, high test error	Memorization (overfitting)	Insufficient experience E or H too expressive
Performance plateau despite more data	Wrong task formulation or performance measure	T or P misspecified
Model performs well on benchmarks but fails in production	Distribution mismatch	Experience E not representative
User complaints despite 'good' metrics	Wrong performance measure	P doesn't capture what users actually want

Design Decisions Through the Formal Lens

Every modeling choice is a choice about T, P, E, or H:

Choosing between logistic regression and neural networks? You're choosing a hypothesis space H.

Collecting more training data? You're increasing experience E.

Using cross-validation instead of a held-out test set? You're refining how you estimate P.

Switching from accuracy to F1 score? You're changing the performance measure P.

Realizing your model should predict probabilities, not just labels? You're redefining the task T.

The formal framework provides a shared vocabulary for teams to discuss these decisions precisely. 'We're overfitting' is vague. 'Our empirical risk is 0.02 but true risk estimate is 0.15, suggesting H is too expressive for the given E' is actionable.

Professional Practice

At top ML teams, model development explicitly tracks these components. Documentation includes the exact task formulation, the evaluation metrics (with justifications), the data sources and their properties, and the model family. This isn't bureaucracy—it's engineering discipline that prevents expensive mistakes.

The Learning Problem, Precisely Stated

We can now state the machine learning problem with complete precision:

Given:

A task T defined over input space X and output space Y
A performance measure P (typically derived from a loss function L)
Experience E in the form of a training set D = {(x₁, y₁), ..., (xₙ, yₙ)} drawn i.i.d. from unknown distribution P(X, Y)
A hypothesis space H of candidate functions h: X → Y

Find:

A hypothesis h* ∈ H that minimizes true risk R(h) = 𝔼[L(y, h(x))]

Constraint:

We cannot directly compute R(h) since P(X, Y) is unknown
We must use the training data D to estimate which h ∈ H will generalize well

This is the core problem. Everything else—specific algorithms, architectures, regularization techniques, evaluation protocols—is about solving this problem in various contexts with various constraints.

The Fundamental Tradeoff

The learning problem embeds a fundamental tradeoff that pervades all of machine learning:

Expressiveness vs. Generalization

A larger, more expressive hypothesis space H can fit more complex functions—including the true target f
But a larger H is also more prone to overfitting: with enough expressiveness, you can fit anything, including noise
A smaller H generalizes more reliably from limited data but may miss the true pattern if f is complex

This is related to the bias-variance tradeoff, which we'll examine in depth later. For now, understand that the art of machine learning lies in choosing a hypothesis space that is 'just right'—expressive enough to capture the true function, but constrained enough to generalize from finite data.

The Goldilocks Principle

A hypothesis space too simple cannot learn complex patterns (underfitting). A hypothesis space too complex will memorize noise rather than learning signal (overfitting). The goal is a hypothesis space that is 'just right'—and the right choice depends on the amount of data, the true underlying complexity, and the cost of different types of errors.

What Learning Is Not

To sharpen our understanding, let's be clear about what does not constitute learning under this definition:

Not Machine Learning

•Pure memorization without generalization — Storing training examples and looking them up at test time may give zero training error but offers no guarantee of generalization. If performance on new data doesn't improve, learning hasn't occurred.
•Hard-coded rules — A system that follows explicitly programmed rules, no matter how sophisticated, isn't learning. The 'experience' doesn't change the system's behavior. (Traditional programming, not ML.)
•Random improvement — A system that happens to improve by chance hasn't learned. Learning implies a systematic relationship between experience and improvement, not statistical fluctuation.
•Database lookup with exact matching — A system that retrieves stored answers for exact input matches hasn't learned the underlying function—it's implemented a lookup table.
•Improvement on training data alone — If a model gets 100% on training data but fails on test data, it hasn't learned the task—it has learned to cheat on the evaluation.

The Boundary Cases

Some cases are genuinely ambiguous:

Nearest-neighbor classifiers memorize training data but use similarity to generalize. Is this learning? By Mitchell's definition, yes—if performance on the task (classification of new examples) improves with more experience (training examples). The mechanism (similarity-based generalization) differs from parametric learning, but the outcome meets the definition.

Large language models are trained on massive text corpora and sometimes appear to 'memorize' specific texts. Are they learning or memorizing? The answer depends on whether performance generalizes to genuinely new inputs. If a model only reproduces exact training text, it hasn't learned language—it has implemented search. If it produces coherent text on novel topics, learning has occurred.

The definition provides a principled way to resolve such debates: measure performance on the task using data not seen during training. Improvement implies learning; no improvement implies memorization or failure.

Generalization Is Essential

The single most important word in machine learning is 'generalization.' A system hasn't truly learned unless it performswell on data it has never seen. Training performance is necessary but not sufficient. Test performance is the true measure.

Historical Context

Mitchell's 1997 definition didn't emerge in a vacuum. It crystallized decades of work in computer science, statistics, and cognitive science:

1950s: The Turing Test and Conceptual Beginnings

Alan Turing's 1950 paper 'Computing Machinery and Intelligence' posed the question of whether machines can think—and proposed learning as a path to intelligence. Turing suggested programming a 'child machine' and then educating it, rather than trying to program adult-level intelligence directly. This foreshadowed the core ML insight: let the machine acquire knowledge from data rather than explicit programming.

1950s-60s: Perceptrons and Early Neural Networks

Frank Rosenblatt's perceptron (1957) was one of the first learning algorithms: it adjusted its weights based on errors, improving classification performance over time. Minsky and Papert's 1969 critique showed the perceptron's limitations but also demonstrated that formal analysis of learning was possible and necessary.

1960s-80s: Statistical Learning Theory Emerges

Vladimir Vapnik and Alexey Chervonenkis developed the mathematical foundations of learning theory, introducing concepts like VC dimension and structural risk minimization. This work provided the first rigorous answers to 'when can we generalize?'—replacing intuition with provable bounds.

1980s: PAC Learning and Computational Perspectives

Leslie Valiant's 1984 paper introduced Probably Approximately Correct (PAC) learning, combining computational complexity with learning theory. For the first time, we could formally ask: 'How much data and time are needed to learn a given concept class?' This computational perspective remains central to modern ML theory.

1990s: Consolidation and Mitchell's Definition

By the 1990s, multiple research communities—AI, statistics, neural networks, computational learning theory—were converging on similar insights. Mitchell's 1997 textbook synthesized this work, providing the T-P-E definition that could encompass all these approaches under one framework.

2000s-Present: Deep Learning and Scale

The deep learning revolution hasn't changed the fundamental definition—it has changed what's achievable within it. Massive hypothesis spaces (billion-parameter neural networks) combined with massive experience (internet-scale data) have shifted what tasks can be solved. But the basic framework—learn from data, measure performance on unseen examples—remains.

Standing on Giants

Understanding this history isn't mere trivia. The concepts we'll study—VC dimension, PAC bounds, bias-variance tradeoff—came from specific intellectual struggles. Knowing why these ideas were needed helps you understand when to apply them and why they matter.

Summary: The Formal Foundation

We've established the formal foundations that will underpin everything in this curriculum. Let's consolidate:

Key Takeaways

•Learning is precisely defined — A program learns from experience E with respect to task T and performance P if its performance at T, as measured by P, improves with E. (Mitchell, 1997)
•The mathematical framework is universal — Input space X, output space Y, hypothesis space H, training data D, loss function L, true risk R(h), empirical risk R̂(h). These concepts apply to all learning problems.
•Generalization is the goal — We care about true risk (performance on unseen data), not empirical risk (training performance). The gap between them is what makes learning hard.
•Every practical decision maps to the framework — Model choice (H), data collection (E), metric selection (P), task formulation (T)—the formal framework organizes all decisions.
•The fundamental tradeoff is expressiveness vs. generalization — More complex hypothesis spaces can fit more functions but require more data to generalize reliably.
•Memorization is not learning — Without improvement on unseen examples, learning hasn't occurred—no matter how complex the system or how low the training error.

What's next:

With the formal definition established, we turn to the more intuitive but equally important question: what does it mean to 'learn from data'? The next page explores this concept in depth—examining how examples become generalizable knowledge, why this process works at all, and what assumptions make it possible.

Foundation Complete

You now possess the formal vocabulary and conceptual framework for machine learning. When someone says 'the model learned,' you can ask precise questions: What was the task? How was performance measured? What data did it learn from? This precision transforms vague discussions into actionable analysis.

1 / 5

Loading learning content...

Machine LearningWhat Is Machine Learning?

What Is Machine Learning?

LevelBeginner

Duration60 mins

TopicWhat Is Machine Learning?

1 / 5

Formal Definition of Machine Learning

What Does It Mean for a Machine to Learn?

We need a formal definition.

What You Will Learn

The Intuitive Picture

Before diving into formalism, let's ground ourselves with examples of what we intuitively recognize as learning:

A spam filter improving over time: When users mark certain emails as spam, the filter gets better at catching similar emails in the future. The system's behavior changes based on feedback.

A chess AI that defeats grandmasters: After analyzing millions of games and playing against itself, the system develops strategies no human explicitly programmed. Knowledge emerged from data.

What's common across these examples? Three things:

Common Elements of Learning

•Experience — There is data or feedback that the learner uses. The child sees many dogs. The spam filter receives user corrections. The chess AI processes game records.
•Task — There is something specific the learner should do. The child should recognize dogs. The filter should classify emails. The AI should make good moves.
•Improvement — Performance on the task gets better with more experience. The child makes fewer misidentifications. The filter catches more spam. The AI wins more games.

But intution has limits. Consider these edge cases:

A program that stores every email ever marked as spam and matches new emails exactly against this database—is this learning, or just memorization?
A system that gets better at training data but worse at new data—has it learned anything useful?
A random number generator that happens to produce better predictions for a while—did it learn?

To resolve these ambiguities, we need precision.

Tom Mitchell's Formal Definition

In his seminal 1997 textbook Machine Learning, Tom Mitchell provided the definition that has become the standard foundation for the field:

Definition: Machine Learning

This definition is deceptively simple. Let's unpack each component with the precision it deserves:

The Task (T)

Classification: Assign inputs to one of several discrete categories (e.g., spam vs. not-spam, digit recognition)
Regression: Predict a continuous numerical value (e.g., house price, temperature tomorrow)
Clustering: Group similar inputs together without predefined categories
Transcription: Convert unstructured input into discrete text (e.g., speech recognition, OCR)
Anomaly detection: Identify unusual or outlier examples
Synthesis and generation: Create new examples similar to training data
Denoising: Recover clean signals from corrupted inputs
Density estimation: Learn the probability distribution of the input space

The task definition matters enormously. A poorly specified task leads to a model that optimizes for the wrong thing—a problem that plagues real-world ML systems.

The Performance Measure (P)

The performance measure quantifies how well the system performs the task. This might seem straightforward, but choosing P is one of the most critical decisions in any ML project.

For a spam classifier:

Accuracy (% correctly classified) seems natural but can be misleading if 99% of emails are non-spam
Precision (% of predicted spam that is actually spam) matters if false positives are costly
Recall (% of actual spam that is caught) matters if missing spam is costly
F1 score (harmonic mean of precision and recall) balances both concerns

The Experience (E)

Experience is the data or interactions from which the system learns. The nature of experience distinguishes fundamentally different learning paradigms:

Supervised learning: Experience consists of examples where inputs are paired with correct outputs (labels). The system learns the mapping from inputs to outputs.
Unsupervised learning: Experience consists of examples without labels. The system must discover structure, patterns, or representations in the data.
Reinforcement learning: Experience consists of interactions with an environment, where the system takes actions, observes states, and receives rewards. Learning happens through trial and error.

The experience also has mathematical properties that affect learning:

i.i.d. assumption: Often we assume examples are drawn independently from the same distribution—violations break many theoretical guarantees
Sample complexity: How much experience is needed to reach a given performance level
Data quality: Noisy, biased, or corrupted experience leads to flawed learning

The Crucial Insight

Mathematical Formalization

To work with machine learning rigorously—to prove theorems, derive algorithms, and analyze behavior—we need mathematical notation. Here's the standard framework that underpins the field:

The Data Space

The Unknown Target

We assume there exists an unknown target function (or target conditional distribution):

f: X → Y

The Training Data

Our experience E typically comes as a training set:

D = {(x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ)}

where each (xᵢ, yᵢ) is an example with input xᵢ ∈ X and label yᵢ ∈ Y.

The Hypothesis Space

H = {h : X → Y}

For linear classifiers, H is the set of all linear functions. For decision trees of depth d, H is all such trees. For neural networks of a given architecture, H is all weight configurations.

The Learning Algorithm

A learning algorithm A is a function that takes training data D and produces a hypothesis h ∈ H:

A: D → h ∈ H

Given training data, the algorithm selects the 'best' hypothesis according to some criterion (usually minimizing training error or a related objective).

The Loss Function

To quantify 'best,' we define a loss function (or cost function):

L: Y × Y → ℝ⁺

L(y, ŷ) measures how bad it is to predict ŷ when the true value is y.

Common loss functions:

0-1 loss (classification): L(y, ŷ) = 𝟙[y ≠ ŷ] (1 if wrong, 0 if right)
Squared loss (regression): L(y, ŷ) = (y - ŷ)²
Absolute loss: L(y, ŷ) = |y - ŷ|
Log loss (probabilistic): L(y, p) = -log(p) if y is the true class

Risk: The Goal of Learning

The true risk (or expected loss, or generalization error) of a hypothesis h is:

R(h) = 𝔼_{(x,y)~P}[L(y, h(x))]

This is the expected loss over all possible examples drawn from the true distribution P. This is what we actually care about—performance on future, unseen data.

The problem: we don't know P, so we can't compute R(h) directly.

Empirical Risk: What We Can Compute

Instead, we compute the empirical risk (or training error):

R̂(h) = (1/n) Σᵢ L(yᵢ, h(xᵢ))

This is the average loss on the training data. Most learning algorithms work by minimizing empirical risk—a principle called Empirical Risk Minimization (ERM).

The Generalization Gap

Why This Definition Matters in Practice

You might wonder: do practitioners actually think in these formal terms? Isn't this just academic abstraction?

The answer is that every practical decision in ML maps to components of this framework. The formal structure isn't overhead—it's the skeleton that everything hangs on.

Debugging with the T-P-E Framework

When a model fails, systematic debugging traces back to the formal components:

Diagnosing ML Failures Through the Formal Framework
Symptom	Possible Cause	Formal Component
High training error, high test error	Model too simple (underfitting)	Hypothesis space H too restrictive
Low training error, high test error	Memorization (overfitting)	Insufficient experience E or H too expressive
Performance plateau despite more data	Wrong task formulation or performance measure	T or P misspecified
Model performs well on benchmarks but fails in production	Distribution mismatch	Experience E not representative
User complaints despite 'good' metrics	Wrong performance measure	P doesn't capture what users actually want

Design Decisions Through the Formal Lens

Every modeling choice is a choice about T, P, E, or H:

Choosing between logistic regression and neural networks? You're choosing a hypothesis space H.

Collecting more training data? You're increasing experience E.

Using cross-validation instead of a held-out test set? You're refining how you estimate P.

Switching from accuracy to F1 score? You're changing the performance measure P.

Realizing your model should predict probabilities, not just labels? You're redefining the task T.

Professional Practice

The Learning Problem, Precisely Stated

We can now state the machine learning problem with complete precision:

Given:

A task T defined over input space X and output space Y
A performance measure P (typically derived from a loss function L)
Experience E in the form of a training set D = {(x₁, y₁), ..., (xₙ, yₙ)} drawn i.i.d. from unknown distribution P(X, Y)
A hypothesis space H of candidate functions h: X → Y

Find:

A hypothesis h* ∈ H that minimizes true risk R(h) = 𝔼[L(y, h(x))]

Constraint:

We cannot directly compute R(h) since P(X, Y) is unknown
We must use the training data D to estimate which h ∈ H will generalize well

The Fundamental Tradeoff

The learning problem embeds a fundamental tradeoff that pervades all of machine learning:

Expressiveness vs. Generalization

A larger, more expressive hypothesis space H can fit more complex functions—including the true target f
But a larger H is also more prone to overfitting: with enough expressiveness, you can fit anything, including noise
A smaller H generalizes more reliably from limited data but may miss the true pattern if f is complex

The Goldilocks Principle

What Learning Is Not

To sharpen our understanding, let's be clear about what does not constitute learning under this definition:

Not Machine Learning

•Pure memorization without generalization — Storing training examples and looking them up at test time may give zero training error but offers no guarantee of generalization. If performance on new data doesn't improve, learning hasn't occurred.
•Hard-coded rules — A system that follows explicitly programmed rules, no matter how sophisticated, isn't learning. The 'experience' doesn't change the system's behavior. (Traditional programming, not ML.)
•Random improvement — A system that happens to improve by chance hasn't learned. Learning implies a systematic relationship between experience and improvement, not statistical fluctuation.
•Database lookup with exact matching — A system that retrieves stored answers for exact input matches hasn't learned the underlying function—it's implemented a lookup table.
•Improvement on training data alone — If a model gets 100% on training data but fails on test data, it hasn't learned the task—it has learned to cheat on the evaluation.

The Boundary Cases

Some cases are genuinely ambiguous:

Generalization Is Essential

Historical Context

Mitchell's 1997 definition didn't emerge in a vacuum. It crystallized decades of work in computer science, statistics, and cognitive science:

1950s: The Turing Test and Conceptual Beginnings

1950s-60s: Perceptrons and Early Neural Networks

1960s-80s: Statistical Learning Theory Emerges

1980s: PAC Learning and Computational Perspectives

1990s: Consolidation and Mitchell's Definition

2000s-Present: Deep Learning and Scale

Standing on Giants

Summary: The Formal Foundation

We've established the formal foundations that will underpin everything in this curriculum. Let's consolidate:

Key Takeaways

•Learning is precisely defined — A program learns from experience E with respect to task T and performance P if its performance at T, as measured by P, improves with E. (Mitchell, 1997)
•The mathematical framework is universal — Input space X, output space Y, hypothesis space H, training data D, loss function L, true risk R(h), empirical risk R̂(h). These concepts apply to all learning problems.
•Generalization is the goal — We care about true risk (performance on unseen data), not empirical risk (training performance). The gap between them is what makes learning hard.
•Every practical decision maps to the framework — Model choice (H), data collection (E), metric selection (P), task formulation (T)—the formal framework organizes all decisions.
•The fundamental tradeoff is expressiveness vs. generalization — More complex hypothesis spaces can fit more functions but require more data to generalize reliably.
•Memorization is not learning — Without improvement on unseen examples, learning hasn't occurred—no matter how complex the system or how low the training error.

What's next:

Foundation Complete

1 / 5