What Is Machine Learning - Learning Module

Loading content...

0/278

Learning from Data

The Miracle of Generalization

Consider an extraordinary feat that seems almost magical: after seeing perhaps 1,000 images of cats and 1,000 images of dogs, a machine learning model can correctly classify a photograph of a cat it has never seen before. Not just any cat—a cat in a pose never observed, under lighting conditions never encountered, at an angle never trained on.

How is this possible?

The training set is finite. The space of possible cat images is effectively infinite—every pixel can take thousands of values, and even small images have millions of pixels. The model has observed an infinitesimally small fraction of all possible cats. Yet it generalizes.

This is the miracle at the heart of machine learning. It's not magic, of course—it works because of deep mathematical principles and assumptions about the structure of the world. Understanding these principles is essential for anyone who wants to do more than treat ML as a black box.

What You Will Learn

This page explores why learning from data works at all. You'll understand the role of inductive bias, the assumptions that make generalization possible, how sample complexity connects data quantity to learning quality, and why the structure of data enables machines to discover patterns that humans might miss.

The Induction Problem

The philosophical problem underlying machine learning is ancient: the problem of induction.

David Hume articulated it in the 18th century: how can we justify reasoning from observed instances to general conclusions? Having seen the sun rise every morning of your life doesn't logically guarantee it will rise tomorrow. Yet we act as if it will.

Machine learning faces the same challenge in precise form. Having observed that 1,000 emails with the word 'lottery' were spam doesn't logically guarantee the next email with 'lottery' is spam. Yet we train classifiers that act as if it does.

Why does this work?

The answer lies not in logic but in probability and structure. We don't claim certainty—we claim that, under certain assumptions about the data-generating process, our predictions are likely to be correct. This shift from deductive certainty to probabilistic confidence is fundamental to understanding ML.

No Free Lunch

The 'No Free Lunch' theorem (Wolpert, 1996) formalizes this: no learning algorithm is universally better than any other across all possible problems. For any algorithm that does well on some problem class, there's a problem class where it performs no better than random guessing. The induction problem cannot be solved in general—it can only be approximately solved for structured problems under specific assumptions.

From Philosophy to Mathematics

What saves machine learning from the induction problem is not philosophical argument but mathematical characterization. Statistical learning theory answers a precise version of the question:

Given n training examples from an unknown distribution P, when can we guarantee that a learned hypothesis will perform well on new examples from P?

The answer involves:

Sample complexity: how n affects learning quality
Hypothesis complexity: how expressive the hypothesis space H is
Distribution assumptions: what we assume about P

We don't solve the induction problem philosophically—we characterize mathematically when induction is reliable and when it fails.

The Role of Inductive Bias

The key insight that enables learning from finite data is inductive bias—assumptions about the nature of the target function that guide the learner toward certain hypotheses over others.

Without inductive bias, learning is impossible. Consider why:

Suppose you observe these training examples for a function f: ℕ → ℕ:

f(1) = 2
f(2) = 4
f(3) = 6
f(4) = 8

What is f(5)?

You probably answered 10, assuming f(x) = 2x. But nothing in the data logically requires this. Consider these equally valid hypotheses:

f(x) = 2x (predicts f(5) = 10)
f(x) = 2x for x ≤ 4, and f(x) = 17 for x > 4 (predicts f(5) = 17)
f(x) = 2x for x ≤ 4, and f(x) = x³ for x > 4 (predicts f(5) = 125)
f(x) = 2x + (x-1)(x-2)(x-3)(x-4) × anything (agrees on training data, arbitrary elsewhere)

Infinitely many functions match the training data exactly. Something beyond the data must guide our choice.

Types of Inductive Bias

Inductive bias takes several forms:

Restriction Bias — Limiting the hypothesis space to certain function families.

Linear classifiers can only learn linear boundaries
Decision trees can only learn axis-aligned splits
Neural networks of fixed architecture can only learn functions representable by that architecture

Preference Bias — Among hypotheses that fit data equally well, preferring some over others.

Occam's Razor: prefer simpler hypotheses
Smoothness: prefer hypotheses where similar inputs give similar outputs
Regularization: prefer hypotheses with smaller parameter values

Representation Bias — The features used to represent inputs embody assumptions.

Using pixel values assumes spatial structure matters
Using bag-of-words assumes word presence matters, not order
Using convolutions assumes translation invariance

Common Inductive Biases in ML Algorithms
Algorithm	Inductive Bias	Works Well When
Linear Regression	True function is linear in features	Relationship is approximately linear, features are well-chosen
k-Nearest Neighbors	Similar inputs have similar outputs (smoothness)	Classes are locally clustered, meaningful distance metric exists
Decision Trees	Data can be split by axis-aligned thresholds	Features have natural orderings with meaningful cutoffs
Neural Networks	Hierarchical composition of simple functions	Problem has compositional structure, sufficient data available
Convolutional Networks	Translation invariance and local patterns	Data has spatial/temporal structure where position is secondary
Gaussian Processes	Function varies smoothly as specified by kernel	Prior knowledge about smoothness/periodicity can be encoded

Bias Is Not Bad

In everyday language, 'bias' has negative connotations. In machine learning, inductive bias is essential and positive—it's what makes learning possible. The challenge is choosing a bias that matches the true structure of your problem. The wrong bias leads to poor generalization; the right bias enables learning from minimal data.

The Data Generation Process

For learning to be possible, we must assume something about how data is generated. The standard assumption in machine learning is the i.i.d. assumption:

Training examples (x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ) are drawn independently and identically distributed (i.i.d.) from an unknown but fixed probability distribution P(X, Y).

Let's unpack what this means and why it matters:

Identical Distribution

Every example comes from the same distribution P. The process that generated the 1st training example is the same process that generated the 1000th. This means:

There is no concept drift (the underlying patterns don't change over time)
There is no sampling bias (the training data is representative of what we'll see at test time)
The statistical properties we estimate from training data apply to future data

When this fails: If you train a spam classifier on 2010 emails and deploy it in 2024, the distribution has changed. Words that indicated spam in 2010 may not indicate spam in 2024. The identical distribution assumption breaks down.

Independence

Each example is drawn independently of the others. Knowing one example gives no additional information about another. This means:

No temporal dependencies (today's example doesn't predict tomorrow's)
No clustering effects (examples aren't grouped in ways that create dependencies)
Statistical calculations assuming independence are valid

When this fails: Consider time series data—yesterday's stock price predicts today's. Or consider user sessions—sequential clicks are highly dependent. Independence assumptions lead to overconfident models when dependencies exist.

The i.i.d. Assumption Enables Theory

The i.i.d. assumption is powerful because it enables concentration inequalities—theorems that bound how much a sample average can deviate from the true expectation.

Intuitively: if examples are i.i.d., the sample mean converges to the true mean as n increases (Law of Large Numbers). More precisely, we can quantify how close the sample mean is likely to be for any finite n (Central Limit Theorem, Hoeffding's inequality, etc.).

For machine learning: this means the empirical risk R̂(h) converges to the true risk R(h) as we get more data. With enough i.i.d. training examples, training performance predicts test performance.

i.i.d. Is Often Violated

In practice, the i.i.d. assumption is frequently violated. Users have behavior patterns that create dependencies. Data distributions shift over time. Training data is often not representative of deployment conditions. Recognizing these violations—and using techniques designed for non-i.i.d. settings (time series models, domain adaptation, etc.)—is part of mature ML practice.

Sample Complexity: How Much Data Is Enough?

One of the most practical questions in machine learning is: how much training data do I need?

Sample complexity theory provides principled answers. The core insight is that the amount of data needed depends on:

How accurate a learner you need (the tolerance ε)
How confident you need to be (the failure probability δ)
How complex your hypothesis space is (measured by VC dimension, Rademacher complexity, etc.)

The PAC Learning Framework

The Probably Approximately Correct (PAC) framework (Valiant, 1984) formalizes this:

A learner is PAC-learnable if there exists a learning algorithm that, for any ε > 0 and δ > 0, given enough training examples, outputs a hypothesis h with:

P(R(h) - R(h) ≤ ε) ≥ 1 - δ*

where R(h) is the true risk of h and R(h*) is the optimal risk in the hypothesis class.

In words: with high probability (at least 1 - δ), the learned hypothesis is approximately optimal (within ε of the best possible).

Sample Complexity Bounds

For a hypothesis space H with VC dimension d, the sample complexity (number of examples needed for PAC learning) is:

n = O(d/ε² × log(1/δ))

This tells us several important things:

More complex hypothesis spaces need more data. A hypothesis space with VC dimension 100 needs roughly 10x more data than one with VC dimension 10 to achieve the same accuracy.

Higher accuracy requires quadratically more data. To halve your error (ε), you need 4x more data. To get 10x lower error, you need 100x more data.

Confidence comes relatively cheap. To go from 90% confidence to 99% confidence requires only about 2.3x more data (since log(1/0.1) / log(1/0.01) ≈ 2.3).

Dimension, not just parameters, matters. A neural network may have millions of parameters but the effective complexity depends on the architecture, regularization, and how parameters interact—not just their count.

Sample Complexity Rules of Thumb
Scenario	Sample Complexity Considerations	Practical Implication
Simple hypothesis class (linear)	VC dim ≈ number of features + 1	Need 10-100× more examples than features
Deep neural networks	Effective complexity << parameter count	Regularization and architecture critically affect data needs
High-dimensional input	Curse of dimensionality applies	Either need exponential data or strong inductive bias
Rare classes (imbalanced data)	Need enough examples per class	Class-specific sample complexity, not just total
Transfer learning / pre-training	Leverages prior knowledge	Dramatically reduces sample complexity for new tasks

Deep Learning's Puzzle

Deep neural networks have millions of parameters, suggesting enormous sample complexity. Yet they often generalize well with 'only' millions of examples—far less than classical bounds would require. Understanding why is an active research area. Answers involve implicit regularization from SGD, over-parameterized interpolation, neural tangent kernels, and the geometry of loss landscapes.

Structure in Data: Why Learning Works

Learning from finite data would be impossible if the world were random. What makes machine learning work is that real-world data has structure.

The Manifold Hypothesis

High-dimensional data (like images or text) often lies on or near a much lower-dimensional manifold embedded in the high-dimensional space.

Consider images: A 256×256 RGB image has 256 × 256 × 3 ≈ 200,000 dimensions. If each pixel were independent, we'd need astronomical amounts of data to learn anything. But real images have massive structure:

Adjacent pixels are usually similar (smoothness)
Objects have coherent shapes (geometric structure)
Lighting follows physical laws (photometric constraints)
Scenes follow compositional rules (semantic structure)

This structure means that the 'true' dimensionality of natural images is vastly lower—perhaps hundreds or thousands of dimensions, not hundreds of thousands. Learning happens on this low-dimensional manifold.

Types of Structure That Enable Learning

Smoothness: Similar inputs tend to produce similar outputs. This is the most basic structure—if the function is smooth, knowing f(x) tells you something about f(x + ε) for small ε. Most learning algorithms implicitly or explicitly assume smoothness.

Sparsity: Most features are irrelevant for most predictions. A document classifier might use only a handful of words out of thousands. Sparsity means the effective dimensionality is much lower than the apparent dimensionality.

Compositionality: Complex patterns are built from simpler patterns. Faces are composed of eyes, noses, mouths; words are composed of letters; sentences are composed of words. This hierarchical structure is why deep networks are effective—each layer captures increasingly abstract compositions.

Symmetry and Invariance: Many transformations don't change the relevant properties. Rotating a dog image doesn't change that it's a dog. Shifting a sound wave in time doesn't change what was said. Encoding these invariances (through convolutions, data augmentation, etc.) dramatically reduces what must be learned.

Clustering: Data often clusters into groups with similar properties. Natural categories (species, genres, topics) reflect this clustering. If structure separates clusters, classification becomes tractable.

Structure Is the Key

Machine learning succeeds when the true structure of the problem matches the inductive bias of the algorithm. Deep learning succeeds on vision because images have hierarchical compositional structure that matches convolutional architectures. Understanding the structure of your problem is essential for choosing effective methods.

Representation: How Data Becomes Knowledge

A critical insight of modern machine learning is that how you represent data determines what you can learn.

What Is a Representation?

A representation is a transformation of raw data into a form suitable for learning. It maps raw inputs (pixels, words, sensor readings) into a feature space where the task becomes tractable.

Example: Digit Recognition

Raw representation: 28×28 = 784 pixel values.

Naive learning: compare each of the 784 pixels independently. This treats the image as an unstructured vector, ignoring spatial relationships. Every slight shift or rotation produces a dramatically different vector.

Learned representation: a neural network transforms pixels into abstract features—strokes, loops, angles—that capture what makes a '7' a '7' regardless of exact positioning. The learned representation is invariant to irrelevant variations.

Example: Text Classification

Raw representation: character sequences.

Bag-of-words representation: count of each word, ignoring order. Simple but loses 'not good' vs. 'good' distinction.

Word embeddings: map each word to a dense vector where similar words have similar vectors. 'excellent' and 'great' become neighbors.

Contextual embeddings (BERT, GPT): each word's representation depends on its context. 'bank' near 'river' differs from 'bank' near 'money.'

Feature Engineering vs. Representation Learning

Historically, human experts designed features based on domain knowledge:

For spam detection: presence of all-caps words, number of exclamation marks, sender reputation
For image recognition: edge histograms, color distributions, texture descriptors
For speech recognition: mel-frequency cepstral coefficients (MFCCs)

This feature engineering was labor-intensive and limited by human insight.

Representation learning changed this paradigm. Instead of human-designed features, the learning algorithm discovers features from data:

Neural networks learn features as hidden layer activations
Word embeddings are learned from co-occurrence statistics
Autoencoders learn compressed representations that capture essential structure

The revolution of deep learning is largely a revolution in representation learning. Deep networks construct hierarchies of increasingly abstract representations—from pixels to edges to textures to object parts to objects to scenes.

Good Representations Are Gold

A good representation makes downstream tasks easy. If you have the right representation, even a linear classifier can achieve high accuracy. Much of modern ML research focuses on learning better representations—through self-supervised learning, contrastive learning, or foundation models—that transfer across tasks.

The Data Flywheel: More Data, Better Learning

One of the most robust findings in machine learning is the scaling law: performance often improves predictably as training data increases.

Power Law Scaling

For many tasks, error decreases according to a power law:

Error ≈ C × n^(-α)

where n is the number of training examples, C is a constant, and α is the scaling exponent (typically between 0.1 and 0.5).

This means:

Doubling data reduces error by a fixed percentage (2^(-α) of the original)
Improvements continue but slow down—getting from 90% to 95% accuracy requires far more data than going from 50% to 55%
For many tasks, there's no clear 'enough data' plateau—performance keeps improving with more data

Implications for practice:

Data collection often provides more value than algorithm improvements
Companies with more data have structural advantages
Synthetic data generation and data augmentation are valuable because they effectively increase n

The Data Flywheel Effect

Successful ML products often create a data flywheel:

Deploy a model trained on initial data
Users interact with the model, generating new data
Collect improved data from user interactions
Retrain and improve the model
Better model attracts more users, generating even more data
Loop accelerates: more data → better model → more users → more data

Example: Google Search

More users → more search queries and clicks
More clicks → better understanding of what results satisfy users
Better understanding → better rankings
Better rankings → more users prefer Google
More users → even more data

This flywheel creates winner-take-all dynamics. The company with the most data builds the best model, attracting more users, generating more data, widening the gap.

Quality vs. Quantity

More data isn't always better. Noisy labels, biased sampling, and distribution mismatch can make additional data harmful. The right 10,000 carefully curated examples may outperform 10 million noisy ones. Data quality is as important as quantity—sometimes more so.

Summary: The Foundation of Learning from Data

We've explored the deep questions underlying how machines can learn from finite experience. Let's consolidate the key insights:

Key Takeaways

•Induction requires assumptions — Without inductive bias, infinitely many hypotheses fit any finite training set. Assumptions about the target function's structure make learning possible.
•The i.i.d. assumption enables theory — If training examples are drawn independently from the same distribution we'll see at test time, training performance predicts test performance as n grows.
•Sample complexity depends on hypothesis complexity — More expressive models need more data. This is quantified by VC dimension, Rademacher complexity, and related measures.
•Real-world data has structure — Smoothness, compositionality, invariance, and clustering make learning tractable even in high-dimensional spaces. The manifold hypothesis explains why.
•Representation is critical — How data is represented determines what can be learned. Modern ML often learns representations rather than assuming fixed features.
•Data scales predictably — Power law scaling means more data reliably improves performance, creating data flywheel dynamics where success breeds more success.

What's next:

Having understood what learning is (formal definition) and how learning from data works (this page), we now contrast machine learning with traditional programming. The next page explores why the ML paradigm exists—what problems demand learning over explicit programming—and how the two approaches fundamentally differ in philosophy and practice.

Page Complete

You now understand the theoretical foundations of learning from data. This isn't abstract philosophy—it directly informs practical decisions: how much data to collect, what algorithms to try, how to diagnose failures, and when to expect improvements from scale. Every ML system rests on these principles.