Loading learning content...
When you first heard the term 'machine learning,' you likely formed an intuitive picture: computers getting smarter, systems that improve with experience, or perhaps AI that can recognize faces and understand speech. But as we embark on a serious study of this field—one that will take us through linear algebra, probability theory, optimization, and statistical learning theory—we need something more precise than intuition.
We need a formal definition.
Not because formalism is inherently valuable, but because precision eliminates confusion. When we say a machine 'learns,' what exactly is happening? How do we know learning has occurred? How do we measure it? Without clear answers, we cannot design, analyze, or improve learning systems.
This page establishes the mathematical and conceptual foundations you need to think rigorously about machine learning. We'll move from intuitive understanding to formal definitions, and you'll see that the precision isn't pedantic—it's powerful.
By the end of this page, you will understand Tom Mitchell's formal definition of machine learning, the mathematical framework underlying all learning systems, and why this definition matters for everything from debugging models to designing new algorithms. You'll be able to precisely articulate what it means when someone says their system 'learned.'
Before diving into formalism, let's ground ourselves with examples of what we intuitively recognize as learning:
A child learning to recognize dogs: After seeing many dogs—golden retrievers, poodles, German shepherds—a child can recognize a new breed they've never seen before as a dog. Something has been extracted from the examples that generalizes.
A spam filter improving over time: When users mark certain emails as spam, the filter gets better at catching similar emails in the future. The system's behavior changes based on feedback.
A chess AI that defeats grandmasters: After analyzing millions of games and playing against itself, the system develops strategies no human explicitly programmed. Knowledge emerged from data.
What's common across these examples? Three things:
This intuitive understanding—learning as improvement on a task through experience—is remarkably robust. It captures what we mean when we say something 'learned,' whether that something is a child, an animal, or a computer program.
But intution has limits. Consider these edge cases:
To resolve these ambiguities, we need precision.
In his seminal 1997 textbook Machine Learning, Tom Mitchell provided the definition that has become the standard foundation for the field:
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
This definition is deceptively simple. Let's unpack each component with the precision it deserves:
The task is what we want the learning system to accomplish. Critically, the task is not 'learning'—learning is the means by which the system becomes capable of performing the task. Common task types include:
The task definition matters enormously. A poorly specified task leads to a model that optimizes for the wrong thing—a problem that plagues real-world ML systems.
The performance measure quantifies how well the system performs the task. This might seem straightforward, but choosing P is one of the most critical decisions in any ML project.
The performance measure must be defined independently of the training process. If we measure performance on the same data used for learning, we cannot distinguish genuine improvement from memorization.
For a spam classifier:
The choice of P shapes what the learning algorithm optimizes for, and thus what behavior emerges. This is why 'Goodhart's Law' haunts ML: when a measure becomes a target, it ceases to be a good measure.
Experience is the data or interactions from which the system learns. The nature of experience distinguishes fundamentally different learning paradigms:
The experience also has mathematical properties that affect learning:
Mitchell's definition makes clear that 'learning' is not mystical. It's a precisely measurable phenomenon: performance P on task T improves with experience E. If we cannot measure improvement, we cannot claim learning occurred. If performance doesn't improve, the system hasn't learned—no matter how sophisticated the algorithm.
To work with machine learning rigorously—to prove theorems, derive algorithms, and analyze behavior—we need mathematical notation. Here's the standard framework that underpins the field:
Let X denote the input space—the set of all possible inputs. For image classification, X might be all possible images of a certain resolution. For spam detection, X might be all possible email texts.
Let Y denote the output space or label space. For binary classification, Y = {0, 1} or {-1, +1}. For multi-class classification with k classes, Y = {1, 2, ..., k}. For regression, Y = ℝ (the real numbers).
We assume there exists an unknown target function (or target conditional distribution):
f: X → Y
This is what we're trying to learn. In classification, f(x) gives the true class of input x. In regression, f(x) gives the true output value. We never observe f directly—we only see examples of input-output pairs that f generated (possibly with noise).
Our experience E typically comes as a training set:
D = {(x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ)}
where each (xᵢ, yᵢ) is an example with input xᵢ ∈ X and label yᵢ ∈ Y.
We usually assume examples are drawn i.i.d. (independently and identically distributed) from some fixed but unknown probability distribution P(X, Y). This assumption is fundamental—it says the training data is representative of what we'll encounter at test time.
A learning algorithm doesn't search the space of all possible functions (that would be infinite and intractable). Instead, it searches within a hypothesis space H—a predefined set of candidate functions:
H = {h : X → Y}
For linear classifiers, H is the set of all linear functions. For decision trees of depth d, H is all such trees. For neural networks of a given architecture, H is all weight configurations.
The choice of H embodies our inductive bias—assumptions about what kinds of functions are likely to be good. If the true target f lies outside H, we cannot learn it exactly, no matter how much data we have.
A learning algorithm A is a function that takes training data D and produces a hypothesis h ∈ H:
A: D → h ∈ H
Given training data, the algorithm selects the 'best' hypothesis according to some criterion (usually minimizing training error or a related objective).
To quantify 'best,' we define a loss function (or cost function):
L: Y × Y → ℝ⁺
L(y, ŷ) measures how bad it is to predict ŷ when the true value is y.
Common loss functions:
The true risk (or expected loss, or generalization error) of a hypothesis h is:
R(h) = 𝔼_{(x,y)~P}[L(y, h(x))]
This is the expected loss over all possible examples drawn from the true distribution P. This is what we actually care about—performance on future, unseen data.
The problem: we don't know P, so we can't compute R(h) directly.
Instead, we compute the empirical risk (or training error):
R̂(h) = (1/n) Σᵢ L(yᵢ, h(xᵢ))
This is the average loss on the training data. Most learning algorithms work by minimizing empirical risk—a principle called Empirical Risk Minimization (ERM).
The fundamental question of machine learning: When does minimizing R̂(h) lead to low R(h)? When does performance on training data predict performance on new data? This is the generalization question, and answering it rigorously is what statistical learning theory is all about.
A model can have near-zero empirical risk (training error) while having high true risk (test error). This gap—called overfitting—is the central challenge of machine learning. The model memorized the training data rather than learning the underlying pattern. Everything in ML, from regularization to cross-validation to architecture design, is in some sense about managing this gap.
You might wonder: do practitioners actually think in these formal terms? Isn't this just academic abstraction?
The answer is that every practical decision in ML maps to components of this framework. The formal structure isn't overhead—it's the skeleton that everything hangs on.
When a model fails, systematic debugging traces back to the formal components:
| Symptom | Possible Cause | Formal Component |
|---|---|---|
| High training error, high test error | Model too simple (underfitting) | Hypothesis space H too restrictive |
| Low training error, high test error | Memorization (overfitting) | Insufficient experience E or H too expressive |
| Performance plateau despite more data | Wrong task formulation or performance measure | T or P misspecified |
| Model performs well on benchmarks but fails in production | Distribution mismatch | Experience E not representative |
| User complaints despite 'good' metrics | Wrong performance measure | P doesn't capture what users actually want |
Every modeling choice is a choice about T, P, E, or H:
Choosing between logistic regression and neural networks? You're choosing a hypothesis space H.
Collecting more training data? You're increasing experience E.
Using cross-validation instead of a held-out test set? You're refining how you estimate P.
Switching from accuracy to F1 score? You're changing the performance measure P.
Realizing your model should predict probabilities, not just labels? You're redefining the task T.
The formal framework provides a shared vocabulary for teams to discuss these decisions precisely. 'We're overfitting' is vague. 'Our empirical risk is 0.02 but true risk estimate is 0.15, suggesting H is too expressive for the given E' is actionable.
At top ML teams, model development explicitly tracks these components. Documentation includes the exact task formulation, the evaluation metrics (with justifications), the data sources and their properties, and the model family. This isn't bureaucracy—it's engineering discipline that prevents expensive mistakes.
We can now state the machine learning problem with complete precision:
Given:
Find:
Constraint:
This is the core problem. Everything else—specific algorithms, architectures, regularization techniques, evaluation protocols—is about solving this problem in various contexts with various constraints.
The learning problem embeds a fundamental tradeoff that pervades all of machine learning:
Expressiveness vs. Generalization
This is related to the bias-variance tradeoff, which we'll examine in depth later. For now, understand that the art of machine learning lies in choosing a hypothesis space that is 'just right'—expressive enough to capture the true function, but constrained enough to generalize from finite data.
A hypothesis space too simple cannot learn complex patterns (underfitting). A hypothesis space too complex will memorize noise rather than learning signal (overfitting). The goal is a hypothesis space that is 'just right'—and the right choice depends on the amount of data, the true underlying complexity, and the cost of different types of errors.
To sharpen our understanding, let's be clear about what does not constitute learning under this definition:
Some cases are genuinely ambiguous:
Nearest-neighbor classifiers memorize training data but use similarity to generalize. Is this learning? By Mitchell's definition, yes—if performance on the task (classification of new examples) improves with more experience (training examples). The mechanism (similarity-based generalization) differs from parametric learning, but the outcome meets the definition.
Large language models are trained on massive text corpora and sometimes appear to 'memorize' specific texts. Are they learning or memorizing? The answer depends on whether performance generalizes to genuinely new inputs. If a model only reproduces exact training text, it hasn't learned language—it has implemented search. If it produces coherent text on novel topics, learning has occurred.
The definition provides a principled way to resolve such debates: measure performance on the task using data not seen during training. Improvement implies learning; no improvement implies memorization or failure.
The single most important word in machine learning is 'generalization.' A system hasn't truly learned unless it performswell on data it has never seen. Training performance is necessary but not sufficient. Test performance is the true measure.
Mitchell's 1997 definition didn't emerge in a vacuum. It crystallized decades of work in computer science, statistics, and cognitive science:
1950s: The Turing Test and Conceptual Beginnings
Alan Turing's 1950 paper 'Computing Machinery and Intelligence' posed the question of whether machines can think—and proposed learning as a path to intelligence. Turing suggested programming a 'child machine' and then educating it, rather than trying to program adult-level intelligence directly. This foreshadowed the core ML insight: let the machine acquire knowledge from data rather than explicit programming.
1950s-60s: Perceptrons and Early Neural Networks
Frank Rosenblatt's perceptron (1957) was one of the first learning algorithms: it adjusted its weights based on errors, improving classification performance over time. Minsky and Papert's 1969 critique showed the perceptron's limitations but also demonstrated that formal analysis of learning was possible and necessary.
1960s-80s: Statistical Learning Theory Emerges
Vladimir Vapnik and Alexey Chervonenkis developed the mathematical foundations of learning theory, introducing concepts like VC dimension and structural risk minimization. This work provided the first rigorous answers to 'when can we generalize?'—replacing intuition with provable bounds.
1980s: PAC Learning and Computational Perspectives
Leslie Valiant's 1984 paper introduced Probably Approximately Correct (PAC) learning, combining computational complexity with learning theory. For the first time, we could formally ask: 'How much data and time are needed to learn a given concept class?' This computational perspective remains central to modern ML theory.
1990s: Consolidation and Mitchell's Definition
By the 1990s, multiple research communities—AI, statistics, neural networks, computational learning theory—were converging on similar insights. Mitchell's 1997 textbook synthesized this work, providing the T-P-E definition that could encompass all these approaches under one framework.
2000s-Present: Deep Learning and Scale
The deep learning revolution hasn't changed the fundamental definition—it has changed what's achievable within it. Massive hypothesis spaces (billion-parameter neural networks) combined with massive experience (internet-scale data) have shifted what tasks can be solved. But the basic framework—learn from data, measure performance on unseen examples—remains.
Understanding this history isn't mere trivia. The concepts we'll study—VC dimension, PAC bounds, bias-variance tradeoff—came from specific intellectual struggles. Knowing why these ideas were needed helps you understand when to apply them and why they matter.
We've established the formal foundations that will underpin everything in this curriculum. Let's consolidate:
What's next:
With the formal definition established, we turn to the more intuitive but equally important question: what does it mean to 'learn from data'? The next page explores this concept in depth—examining how examples become generalizable knowledge, why this process works at all, and what assumptions make it possible.
You now possess the formal vocabulary and conceptual framework for machine learning. When someone says 'the model learned,' you can ask precise questions: What was the task? How was performance measured? What data did it learn from? This precision transforms vague discussions into actionable analysis.