Machine LearningKey Concepts and Terminology

Key Concepts and Terminology

LevelBeginner

Duration90 mins

TopicKey Concepts and Terminology

3 / 5

The Hypothesis Space: The Universe of Possible Models

What is a Model Actually Searching For?

When we say a machine learning algorithm 'learns' a model, what does that actually mean? What are the possible things it could learn? How does it decide which one to pick?

These questions lead us to one of the most fundamental concepts in machine learning theory: the hypothesis space.

The hypothesis space is the set of all possible functions that a learning algorithm is allowed to consider. It's the universe within which learning takes place. Understanding hypothesis spaces illuminates:

Why some algorithms work well for certain problems and poorly for others
The fundamental tradeoff between expressiveness and learnability
Why overfitting occurs and how to prevent it
The theoretical foundations that underpin all of machine learning

This is where machine learning moves from practical engineering to rigorous mathematics—and understanding this transition is essential for developing principled intuition about ML.

What You Will Master

By the end of this page, you will understand: • The formal definition of hypothesis space • How different algorithms define different hypothesis spaces • The relationship between hypothesis space size and generalization • The bias-variance perspective on hypothesis selection • Why restricting the hypothesis space is often beneficial

The Hypothesis Space: Formal Definition

Let's establish the formal vocabulary.

Definition: Hypothesis

A hypothesis $h$ is a function that maps inputs to outputs:

$$h: \mathcal{X} \rightarrow \mathcal{Y}$$

where $\mathcal{X}$ is the input space (feature space) and $\mathcal{Y}$ is the output space (label space).

For classification: $h(\mathbf{x})$ predicts a class label For regression: $h(\mathbf{x})$ predicts a real value

Definition: Hypothesis Space (or Hypothesis Class)

The hypothesis space $\mathcal{H}$ is the set of all hypotheses that the learning algorithm considers:

$$\mathcal{H} = {h : \mathcal{X} \rightarrow \mathcal{Y}}$$

This set is determined by the choice of algorithm and any constraints we impose. The learning problem becomes: find the hypothesis $h^ \in \mathcal{H}$ that best fits the training data and generalizes well.*

The Key Insight

Learning cannot find a solution outside the hypothesis space. If the true underlying function (the one that generated the data) is not contained in ℋ, the best we can do is find the closest approximation within ℋ. The choice of ℋ fundamentally limits what learning can achieve.

Example: Linear Classifiers in 2D

Consider binary classification with 2D inputs $\mathbf{x} = [x_1, x_2]^T$.

A linear classifier hypothesis takes the form:

$$h_{\mathbf{w},b}(\mathbf{x}) = \text{sign}(w_1 x_1 + w_2 x_2 + b) = \text{sign}(\mathbf{w}^T \mathbf{x} + b)$$

The hypothesis space of all linear classifiers in 2D is:

$$\mathcal{H}{linear} = {h{\mathbf{w},b} : \mathbf{w} \in \mathbb{R}^2, b \in \mathbb{R}}$$

This is parameterized by 3 real numbers: $w_1$, $w_2$, and $b$. Each setting of these parameters defines a different hypothesis (a different line in 2D that separates classes).

What's NOT in This Hypothesis Space?

Circular decision boundaries (nonlinear)
XOR patterns (requires non-convex regions)
Complex shapes that cannot be linearly separated

If the true pattern requires a circular boundary, no linear classifier will ever find it—it's outside the hypothesis space.

hypothesis_space_linear.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_circles, make_blobs
 
# Create two datasets: one linearly separable, one not
np.random.seed(42)
 
# Linearly separable data
X_linear, y_linear = make_blobs(n_samples=200, centers=2, 
                                 cluster_std=1.5, random_state=42)
 
# Non-linearly separable data (circles inside circles)
X_circles, y_circles = make_circles(n_samples=200, noise=0.1, 
                                     factor=0.3, random_state=42)
 
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
 
for ax, X, y, title in [(axes[0], X_linear, y_linear, 'Linearly Separable'),
                         (axes[1], X_circles, y_circles, 'Not Linearly Separable (Circles)')]:
    
    # Fit linear classifier
    clf = LogisticRegression()
    clf.fit(X, y)
    
    # Plot decision boundary
    xx, yy = np.meshgrid(np.linspace(X[:, 0].min()-1, X[:, 0].max()+1, 100),
                         np.linspace(X[:, 1].min()-1, X[:, 1].max()+1, 100))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
    
    ax.contourf(xx, yy, Z, alpha=0.3, cmap='RdBu')
    ax.scatter(X[y==0, 0], X[y==0, 1], c='blue', label='Class 0', edgecolors='black')
    ax.scatter(X[y==1, 0], X[y==1, 1], c='red', label='Class 1', edgecolors='black')
    ax.set_title(f'{title}\nAccuracy: {clf.score(X, y):.2f}')
    ax.legend()
 
plt.suptitle('The Hypothesis Space of Linear Classifiers', fontsize=14)
plt.tight_layout()
plt.show()
 
# Key insight: Even with perfect optimization, the circles dataset 
# cannot be classified well because the TRUE boundary is circular,
# which is OUTSIDE the linear hypothesis space

Examples of Hypothesis Spaces

Different machine learning algorithms define different hypothesis spaces. Understanding what each algorithm can and cannot represent is crucial for algorithm selection.

Hypothesis Spaces of Common ML Algorithms
Algorithm	Hypothesis Space $\mathcal{H}$	Size/Complexity	Can Represent
Linear Regression	${\mathbf{w}^T\mathbf{x} + b : \mathbf{w} \in \mathbb{R}^d, b \in \mathbb{R}}$	$d + 1$ parameters (finite-dimensional)	Linear relationships only
Polynomial Regression (degree p)	${\sum_{i=0}^{p} w_i x^i : \mathbf{w} \in \mathbb{R}^{p+1}}$	$p + 1$ parameters	Polynomial curves up to degree $p$
Decision Tree (depth d)	All trees with depth ≤ $d$	At most $2^d$ leaves	Axis-aligned rectangular partitions
k-NN (k neighbors)	Implicitly defined by training data	Non-parametric (grows with data)	Arbitrary complex boundaries (if k is small)
SVM with Linear Kernel	Linear separating hyperplanes	$d + 1$ parameters	Linearly separable patterns
SVM with RBF Kernel	Functions in infinite-dimensional RKHS	Infinite-dimensional (but regularized)	Very complex, smooth boundaries
Neural Network (fixed architecture)	${f_\theta : \theta \in \mathbb{R}^p}$	$p$ parameters (can be millions)	Universal approximators (with enough capacity)

Parametric vs Non-Parametric Models:

Parametric models have a fixed-size hypothesis space regardless of training data size:

Linear regression: always $d + 1$ parameters
Logistic regression: always $d + 1$ parameters
Fixed-architecture neural networks: fixed number of weights

Advantage: Efficient inference, doesn't grow with data Disadvantage: May not capture complex patterns

Non-parametric models have hypothesis complexity that grows with data:

k-NN: stores all training examples
Kernel SVM: support vectors grow with data
Decision trees: can grow to fit any configuration
Gaussian Processes: one weight per training point

Advantage: Can capture arbitrarily complex patterns Disadvantage: Complexity grows, can overfit easily

The Universal Approximation Theorem

Neural networks with at least one hidden layer and nonlinear activation can approximate any continuous function to arbitrary precision—given enough hidden units. This means neural network hypothesis spaces can be made arbitrarily expressive. However, 'can in principle' and 'can in practice' are very different. Training challenges, finite data, and computational limits mean we rarely achieve this theoretical expressiveness.

Learning as Search Over the Hypothesis Space

Learning can be viewed as a search problem: given training data $\mathcal{D}$, find the hypothesis $h^* \in \mathcal{H}$ that best explains the data and will generalize well.

The Learning Objective:

Define a loss function $L(h, \mathcal{D})$ that measures how poorly hypothesis $h$ fits the data. Learning seeks:

$$h^* = \arg\min_{h \in \mathcal{H}} L(h, \mathcal{D})$$

How the Search Works Depends on $\mathcal{H}$:

Finite Hypothesis Spaces: Can enumerate all possibilities (though often infeasible)
- Decision stumps (single-split trees): finite number of possible splits
- Enumerate and select the best
Continuous, Convex Hypothesis Spaces: Gradient descent finds global optimum
- Linear regression, logistic regression
- Follow the gradient to the minimum
Continuous, Non-Convex Hypothesis Spaces: Local optimization, hope for good local minima
- Neural networks, mixture models
- Gradient descent finds A minimum, not necessarily THE minimum
Discrete, Combinatorial Spaces: Heuristic search, approximations
- Decision tree structure
- Greedy algorithms, beam search

hypothesis_search.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
 
# Generate data from a quadratic function with noise
np.random.seed(42)
X = np.linspace(-3, 3, 30).reshape(-1, 1)
y_true = 0.5 * X.ravel()**2 + 2  # True function: y = 0.5x² + 2
y = y_true + np.random.randn(30) * 0.5  # Add noise
 
# Search over hypothesis spaces of different polynomial degrees
print("Searching over hypothesis spaces of increasing complexity")
print("=" * 60)
 
for degree in [0, 1, 2, 3, 5, 10]:
    # Create polynomial features → defines hypothesis space
    poly = PolynomialFeatures(degree=degree)
    X_poly = poly.fit_transform(X)
    
    # Search within this hypothesis space (closed-form for linear regression)
    model = LinearRegression()
    model.fit(X_poly, y)
    
    # The selected hypothesis
    y_pred = model.predict(X_poly)
    mse = mean_squared_error(y, y_pred)
    
    # Number of parameters defines hypothesis space size
    n_params = X_poly.shape[1]
    
    print(f"\nDegree {degree} polynomials:")
    print(f"  Hypothesis space: all polynomials up to degree {degree}")
    print(f"  Number of parameters: {n_params}")
    print(f"  Training MSE: {mse:.4f}")
    print(f"  Selected hypothesis: p(x) = {model.intercept_:.3f}", end="")
    for i, coef in enumerate(model.coef_[1:], 1):
        if abs(coef) > 0.001:
            print(f" + {coef:.3f}x^{i}", end="")
    print()
 
# Key insight: As hypothesis space grows, training error decreases
# But at degree 10, we're likely overfitting to noise

The Search is Not Always Exhaustive

For most practical hypothesis spaces, we cannot explore every possibility. Gradient descent, greedy algorithms, and other heuristics explore a path through the space, not the whole space. The hypothesis we find depends on initialization, hyperparameters, and luck. Two runs of neural network training with different random seeds can find very different hypotheses—all in the same hypothesis space.

Hypothesis Space Size and Generalization

Here's a fundamental tension in machine learning:

Larger hypothesis spaces → More expressive, can fit more patterns But also → More risk of fitting noise, worse generalization

This isn't just intuitive—it's mathematically formalized in learning theory.

The Generalization Bound (Informal):

With probability at least $1 - \delta$, for any hypothesis $h \in \mathcal{H}$:

$$\text{True Error}(h) \leq \text{Training Error}(h) + \text{Complexity Term}(|\mathcal{H}|, n, \delta)$$

where:

True Error = error on unseen data (what we care about)
Training Error = error on training data (what we observe)
Complexity Term = penalty that grows with hypothesis space complexity
$n$ = number of training examples

The complexity term depends on measures like:

VC dimension (for classification): Roughly, the largest set of points the hypothesis class can shatter
Rademacher complexity: How well the hypothesis class can fit random noise
Number of parameters: Crude but often useful proxy

Implications:

Simpler hypothesis spaces generalize better (all else equal)
- If two hypothesis spaces both contain the true function, the smaller one will typically generalize better
More training data allows larger hypothesis spaces
- The complexity term shrinks with $n$, so you can afford more expressiveness
Regularization shrinks the effective hypothesis space
- Even if $\mathcal{H}$ is large, regularization restricts which hypotheses are reachable
The right hypothesis space matches the problem complexity
- Too small: can't capture the pattern (underfitting)
- Too large: captures noise along with pattern (overfitting)

Underfitting (ℋ too small)

•High training error
•High test error
•Model too simple for the data
•The true function is not in ℋ
•Adding complexity helps
•Also called 'high bias'

Overfitting (ℋ too large)

•Low training error
•High test error
•Model too complex for the data amount
•Fitting noise rather than signal
•Adding regularization helps
•Also called 'high variance'

hypothesis_complexity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
 
# Generate data
np.random.seed(42)
n = 30
X = np.linspace(0, 1, n).reshape(-1, 1)
y_true = np.sin(2 * np.pi * X.ravel())
y = y_true + 0.3 * np.random.randn(n)
 
# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
 
# Try different hypothesis space sizes (polynomial degrees)
degrees = range(0, 15)
train_errors = []
test_errors = []
 
for degree in degrees:
    poly = PolynomialFeatures(degree=degree)
    X_train_poly = poly.fit_transform(X_train)
    X_test_poly = poly.transform(X_test)
    
    model = LinearRegression()
    model.fit(X_train_poly, y_train)
    
    train_errors.append(mean_squared_error(y_train, model.predict(X_train_poly)))
    test_errors.append(mean_squared_error(y_test, model.predict(X_test_poly)))
 
# Plot the relationship between hypothesis space complexity and error
plt.figure(figsize=(10, 6))
plt.plot(degrees, train_errors, 'b-o', label='Training Error', markersize=8)
plt.plot(degrees, test_errors, 'r-o', label='Test Error', markersize=8)
plt.axvline(x=3, color='green', linestyle='--', label='Optimal Complexity', alpha=0.7)
 
plt.fill_between([0, 2], [0]*2, [1]*2, alpha=0.1, color='blue', label='Underfitting Zone')
plt.fill_between([6, 14], [0]*2, [1]*2, alpha=0.1, color='red', label='Overfitting Zone')
 
plt.xlabel('Polynomial Degree (Hypothesis Space Complexity)')
plt.ylabel('Mean Squared Error')
plt.title('Hypothesis Space Size vs. Generalization Error')
plt.legend()
plt.yscale('log')
plt.ylim(0.001, 10)
plt.grid(True, alpha=0.3)
plt.show()
 
print("Key observations:")
print("- Training error monotonically decreases with complexity")
print("- Test error decreases, then increases (overfitting)")
print("- Sweet spot: degree 3-4 (matches true sine-like pattern)")

Inductive Bias: The Assumptions We Make

Every learning algorithm embodies inductive biases—assumptions about what kinds of hypotheses are preferred. Without these biases, learning would be impossible.

Why Bias is Necessary:

Consider a classification problem with 10 training examples. There are infinitely many hypotheses consistent with these 10 points:

One that classifies exactly these 10 correctly (and labels everything else positive)
Another that classifies these 10 correctly (and labels everything else negative)
Infinitely many more with arbitrary behavior outside the training points

All have zero training error. How do we choose? We need preferences—biases—about which hypotheses are more likely to generalize.

Common Inductive Biases:

Types of Inductive Bias

•Smoothness: Nearby inputs should have similar outputs (k-NN, RBF kernels, Gaussian Processes)
•Linearity: Relationships are approximately linear (linear regression, perceptrons)
•Simplicity/Parsimony: Prefer simpler explanations; Occam's Razor (regularization, decision tree pruning)
•Sparsity: Only a few features matter (Lasso, sparse autoencoders)
•Translation Invariance: Patterns are the same regardless of position (convolutional neural networks)
•Temporal Coherence: Sequential data has local dependencies (RNNs, Hidden Markov Models)
•Hierarchical Composition: Complex patterns are built from simpler ones (deep networks)
•Low Rank: Data lies on a lower-dimensional manifold (PCA, matrix factorization)

Matching Bias to Problem:

The art of machine learning includes choosing algorithms whose inductive biases match the problem structure:

Problem Domain	Useful Bias	Suggested Algorithms
Image recognition	Spatial hierarchy, translation invariance	CNNs
Tabular data	Feature interactions, nonlinearity	Gradient boosting, random forests
Natural language	Sequential dependencies, compositionality	Transformers, RNNs
Physical simulations	Smoothness, conservation laws	Physics-informed neural networks
Graph-structured data	Permutation invariance, locality	Graph neural networks
Time series	Temporal smoothness, periodicity	RNNs, temporal convolutions

When the inductive bias matches the problem, learning is efficient and generalization is good. When it doesn't, even vast amounts of data may not help.

The No Free Lunch Theorem

The No Free Lunch theorem proves that no algorithm is universally better than all others across all possible problems. What makes an algorithm good for one problem makes it worse for another. The goal is matching the algorithm's inductive bias to your specific problem—not finding a 'best' algorithm.

Realizable vs Agnostic Learning

Learning theory distinguishes two settings based on whether the true function is in our hypothesis space.

Realizable Setting:

The true function $f^*$ that generated the data is contained in $\mathcal{H}$:

$$f^* \in \mathcal{H}$$

In this setting:

There exists a hypothesis with zero true error
Learning can find a perfect predictor (with enough data)
Theoretical guarantees are strongest

Agnostic Setting:

We make no assumption that $f^* \in \mathcal{H}$:

The best we can hope for is to find the hypothesis $h^*$ that minimizes error among all $h \in \mathcal{H}$:

$$h^* = \arg\min_{h \in \mathcal{H}} \text{True Error}(h)$$

In this setting:

Even the best hypothesis has nonzero error (approximation error)
We compete against the best-in-class, not perfection
More realistic for complex real-world problems

Realizable Learning

•True function is in ℋ
•Zero error is achievable
•Sample complexity bounds are tighter
•Consistent learners exist
•"We know the right model form"
•Example: Linear data + linear model

Agnostic Learning

•True function may not be in ℋ
•Best-in-class has nonzero error
•Bounds include approximation error
•Compete with best hypothesis in ℋ
•"We're approximating reality"
•Example: Complex data + simple model

Decomposing Error:

In the agnostic setting, the error of a learned hypothesis $\hat{h}$ can be decomposed:

$$\text{Error}(\hat{h}) = \underbrace{\text{Error}(h^)}_{\text{Approximation Error}} + \underbrace{(\text{Error}(\hat{h}) - \text{Error}(h^)}_{\text{Estimation Error}}$$

where:

Approximation Error: Error of the best hypothesis in $\mathcal{H}$. This is zero in the realizable case. Reduced by making $\mathcal{H}$ larger.
Estimation Error: Difference between the learned hypothesis and the best-in-class. Reduced by getting more data or making $\mathcal{H}$ smaller.

This is the essence of the bias-variance tradeoff viewed through hypothesis spaces:

Large $\mathcal{H}$: Low approximation error, high estimation error (variance)
Small $\mathcal{H}$: High approximation error (bias), low estimation error

Practical Reality is Agnostic

In practice, we almost never know the true function, so the agnostic setting is more realistic. We choose hypothesis spaces hoping they're expressive enough to capture the essential patterns while remaining constrained enough to enable generalization. It's always a guess—informed by domain knowledge and validated empirically.

Restricting the Hypothesis Space

Given the tradeoff between expressiveness and generalization, how do we effectively restrict the hypothesis space? Several mechanisms work together:

1. Architecture Choice:

The structure of the model defines $\mathcal{H}$:

A 2-layer neural network defines a different $\mathcal{H}$ than a 5-layer one
A decision tree with max depth 3 is in a smaller $\mathcal{H}$ than one with max depth 10
A polynomial of degree 2 is simpler than degree 20

2. Regularization:

Penalizing certain hypotheses makes them effectively unreachable:

L2 regularization (Ridge): penalizes large weights → encourages simpler, smoother functions
L1 regularization (Lasso): penalizes non-zero weights → encourages sparse solutions
Dropout: randomly dropping neurons → encourages distributed representations
Early stopping: stop before full convergence → limits hypothesis complexity

3. Hyperparameter Settings:

Learning rate: affects which minima are reachable
Batch size: larger batches can lead to sharper minima
Tree depth: limits decision tree expressiveness
k in k-NN: higher k means smoother decision boundaries

restricting_hypothesis_space.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
 
# Generate noisy data
np.random.seed(42)
n = 30
X = np.linspace(0, 1, n).reshape(-1, 1)
y_true = np.sin(4 * np.pi * X.ravel())
y = y_true + 0.5 * np.random.randn(n)
 
# High-degree polynomial = large hypothesis space
# Without regularization: overfits
# With regularization: restricted to simpler functions
 
degree = 15
alphas = [0, 0.0001, 0.001, 0.1, 10]
 
fig, axes = plt.subplots(1, len(alphas), figsize=(20, 4))
X_plot = np.linspace(0, 1, 100).reshape(-1, 1)
 
for ax, alpha in zip(axes, alphas):
    # Create pipeline: polynomial features + ridge regression
    pipeline = Pipeline([
        ('poly', PolynomialFeatures(degree=degree)),
        ('ridge', Ridge(alpha=alpha))
    ])
    
    pipeline.fit(X, y)
    y_pred = pipeline.predict(X_plot)
    
    # Get coefficient magnitudes
    coef_norm = np.linalg.norm(pipeline.named_steps['ridge'].coef_)
    
    ax.scatter(X, y, color='blue', s=50, label='Data', alpha=0.6)
    ax.plot(X_plot, y_pred, 'r-', linewidth=2, label='Model')
    ax.plot(X_plot, np.sin(4 * np.pi * X_plot.ravel()), 'g--', 
            linewidth=1, label='True', alpha=0.7)
    ax.set_title(f'α = {alpha}\n||w|| = {coef_norm:.1f}')
    ax.set_ylim(-3, 3)
    ax.legend(fontsize=8)
 
plt.suptitle('Regularization Restricts the Effective Hypothesis Space', fontsize=14)
plt.tight_layout()
plt.show()
 
# Key insight: Same hypothesis space (degree-15 polynomials)
# But regularization makes most of that space "expensive" to reach
# Effectively restricting the learner to simpler functions

The Regularization Perspective

Regularization doesn't change the hypothesis space—it changes the search process. With regularization, hypotheses with large parameters become very 'expensive' to reach, so the optimizer settles for simpler hypotheses that trade a little training fit for much simpler representations.

Summary: The Hypothesis Space Framework

We've established the theoretical framework for understanding what machine learning algorithms actually search over and how this shapes what they can learn.

Key Takeaways

•The hypothesis space is the learning universe — It defines all possible functions the algorithm can consider. Learning cannot discover solutions outside this space.
•Algorithms define hypothesis spaces — Linear regression searches over linear functions. Neural networks search over compositions of nonlinear transformations. Each choice has implications.
•Larger is not always better — Larger hypothesis spaces can represent more patterns but are harder to search and more prone to overfitting.
•Inductive bias is essential — Without assumptions about which hypotheses are preferable, learning would be impossible. Match the bias to the problem.
•Realizable vs agnostic — We usually operate in the agnostic setting where we're approximating, not discovering, the true function.
•Regularization restricts effectively — Regularization and other constraints shrink the effective hypothesis space, improving generalization.
•The bias-variance tradeoff — Small ℋ = high bias (underfitting); Large ℋ = high variance (overfitting). The sweet spot depends on data quantity and problem complexity.

What's Next:

We've established what models search over and how to evaluate them. But how do we actually measure 'fit'? The next page introduces loss functions—the mathematical quantification of model error that drives learning.

Page Complete

You now understand the hypothesis space—the set of all possible functions a learning algorithm considers. You can reason about how algorithm choice defines the hypothesis space, why restricting it aids generalization, and how inductive bias shapes learning. Next, we'll explore loss functions.

3 / 5

Loading learning content...

Machine LearningKey Concepts and Terminology

Key Concepts and Terminology

LevelBeginner

Duration90 mins

TopicKey Concepts and Terminology

3 / 5

The Hypothesis Space: The Universe of Possible Models

What is a Model Actually Searching For?

When we say a machine learning algorithm 'learns' a model, what does that actually mean? What are the possible things it could learn? How does it decide which one to pick?

These questions lead us to one of the most fundamental concepts in machine learning theory: the hypothesis space.

Why some algorithms work well for certain problems and poorly for others
The fundamental tradeoff between expressiveness and learnability
Why overfitting occurs and how to prevent it
The theoretical foundations that underpin all of machine learning

This is where machine learning moves from practical engineering to rigorous mathematics—and understanding this transition is essential for developing principled intuition about ML.

What You Will Master

The Hypothesis Space: Formal Definition

Let's establish the formal vocabulary.

Definition: Hypothesis

A hypothesis $h$ is a function that maps inputs to outputs:

$$h: \mathcal{X} \rightarrow \mathcal{Y}$$

where $\mathcal{X}$ is the input space (feature space) and $\mathcal{Y}$ is the output space (label space).

For classification: $h(\mathbf{x})$ predicts a class label For regression: $h(\mathbf{x})$ predicts a real value

Definition: Hypothesis Space (or Hypothesis Class)

The hypothesis space $\mathcal{H}$ is the set of all hypotheses that the learning algorithm considers:

$$\mathcal{H} = {h : \mathcal{X} \rightarrow \mathcal{Y}}$$

The Key Insight

Example: Linear Classifiers in 2D

Consider binary classification with 2D inputs $\mathbf{x} = [x_1, x_2]^T$.

A linear classifier hypothesis takes the form:

$$h_{\mathbf{w},b}(\mathbf{x}) = \text{sign}(w_1 x_1 + w_2 x_2 + b) = \text{sign}(\mathbf{w}^T \mathbf{x} + b)$$

The hypothesis space of all linear classifiers in 2D is:

$$\mathcal{H}{linear} = {h{\mathbf{w},b} : \mathbf{w} \in \mathbb{R}^2, b \in \mathbb{R}}$$

This is parameterized by 3 real numbers: $w_1$, $w_2$, and $b$. Each setting of these parameters defines a different hypothesis (a different line in 2D that separates classes).

What's NOT in This Hypothesis Space?

Circular decision boundaries (nonlinear)
XOR patterns (requires non-convex regions)
Complex shapes that cannot be linearly separated

If the true pattern requires a circular boundary, no linear classifier will ever find it—it's outside the hypothesis space.

hypothesis_space_linear.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_circles, make_blobs
 
# Create two datasets: one linearly separable, one not
np.random.seed(42)
 
# Linearly separable data
X_linear, y_linear = make_blobs(n_samples=200, centers=2, 
                                 cluster_std=1.5, random_state=42)
 
# Non-linearly separable data (circles inside circles)
X_circles, y_circles = make_circles(n_samples=200, noise=0.1, 
                                     factor=0.3, random_state=42)
 
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
 
for ax, X, y, title in [(axes[0], X_linear, y_linear, 'Linearly Separable'),
                         (axes[1], X_circles, y_circles, 'Not Linearly Separable (Circles)')]:
    
    # Fit linear classifier
    clf = LogisticRegression()
    clf.fit(X, y)
    
    # Plot decision boundary
    xx, yy = np.meshgrid(np.linspace(X[:, 0].min()-1, X[:, 0].max()+1, 100),
                         np.linspace(X[:, 1].min()-1, X[:, 1].max()+1, 100))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
    
    ax.contourf(xx, yy, Z, alpha=0.3, cmap='RdBu')
    ax.scatter(X[y==0, 0], X[y==0, 1], c='blue', label='Class 0', edgecolors='black')
    ax.scatter(X[y==1, 0], X[y==1, 1], c='red', label='Class 1', edgecolors='black')
    ax.set_title(f'{title}\nAccuracy: {clf.score(X, y):.2f}')
    ax.legend()
 
plt.suptitle('The Hypothesis Space of Linear Classifiers', fontsize=14)
plt.tight_layout()
plt.show()
 
# Key insight: Even with perfect optimization, the circles dataset 
# cannot be classified well because the TRUE boundary is circular,
# which is OUTSIDE the linear hypothesis space

Examples of Hypothesis Spaces

Different machine learning algorithms define different hypothesis spaces. Understanding what each algorithm can and cannot represent is crucial for algorithm selection.

Hypothesis Spaces of Common ML Algorithms
Algorithm	Hypothesis Space $\mathcal{H}$	Size/Complexity	Can Represent
Linear Regression	${\mathbf{w}^T\mathbf{x} + b : \mathbf{w} \in \mathbb{R}^d, b \in \mathbb{R}}$	$d + 1$ parameters (finite-dimensional)	Linear relationships only
Polynomial Regression (degree p)	${\sum_{i=0}^{p} w_i x^i : \mathbf{w} \in \mathbb{R}^{p+1}}$	$p + 1$ parameters	Polynomial curves up to degree $p$
Decision Tree (depth d)	All trees with depth ≤ $d$	At most $2^d$ leaves	Axis-aligned rectangular partitions
k-NN (k neighbors)	Implicitly defined by training data	Non-parametric (grows with data)	Arbitrary complex boundaries (if k is small)
SVM with Linear Kernel	Linear separating hyperplanes	$d + 1$ parameters	Linearly separable patterns
SVM with RBF Kernel	Functions in infinite-dimensional RKHS	Infinite-dimensional (but regularized)	Very complex, smooth boundaries
Neural Network (fixed architecture)	${f_\theta : \theta \in \mathbb{R}^p}$	$p$ parameters (can be millions)	Universal approximators (with enough capacity)

Parametric vs Non-Parametric Models:

Parametric models have a fixed-size hypothesis space regardless of training data size:

Linear regression: always $d + 1$ parameters
Logistic regression: always $d + 1$ parameters
Fixed-architecture neural networks: fixed number of weights

Advantage: Efficient inference, doesn't grow with data Disadvantage: May not capture complex patterns

Non-parametric models have hypothesis complexity that grows with data:

k-NN: stores all training examples
Kernel SVM: support vectors grow with data
Decision trees: can grow to fit any configuration
Gaussian Processes: one weight per training point

Advantage: Can capture arbitrarily complex patterns Disadvantage: Complexity grows, can overfit easily

The Universal Approximation Theorem

Learning as Search Over the Hypothesis Space

Learning can be viewed as a search problem: given training data $\mathcal{D}$, find the hypothesis $h^* \in \mathcal{H}$ that best explains the data and will generalize well.

The Learning Objective:

Define a loss function $L(h, \mathcal{D})$ that measures how poorly hypothesis $h$ fits the data. Learning seeks:

$$h^* = \arg\min_{h \in \mathcal{H}} L(h, \mathcal{D})$$

How the Search Works Depends on $\mathcal{H}$:

Finite Hypothesis Spaces: Can enumerate all possibilities (though often infeasible)
- Decision stumps (single-split trees): finite number of possible splits
- Enumerate and select the best
Continuous, Convex Hypothesis Spaces: Gradient descent finds global optimum
- Linear regression, logistic regression
- Follow the gradient to the minimum
Continuous, Non-Convex Hypothesis Spaces: Local optimization, hope for good local minima
- Neural networks, mixture models
- Gradient descent finds A minimum, not necessarily THE minimum
Discrete, Combinatorial Spaces: Heuristic search, approximations
- Decision tree structure
- Greedy algorithms, beam search

hypothesis_search.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
 
# Generate data from a quadratic function with noise
np.random.seed(42)
X = np.linspace(-3, 3, 30).reshape(-1, 1)
y_true = 0.5 * X.ravel()**2 + 2  # True function: y = 0.5x² + 2
y = y_true + np.random.randn(30) * 0.5  # Add noise
 
# Search over hypothesis spaces of different polynomial degrees
print("Searching over hypothesis spaces of increasing complexity")
print("=" * 60)
 
for degree in [0, 1, 2, 3, 5, 10]:
    # Create polynomial features → defines hypothesis space
    poly = PolynomialFeatures(degree=degree)
    X_poly = poly.fit_transform(X)
    
    # Search within this hypothesis space (closed-form for linear regression)
    model = LinearRegression()
    model.fit(X_poly, y)
    
    # The selected hypothesis
    y_pred = model.predict(X_poly)
    mse = mean_squared_error(y, y_pred)
    
    # Number of parameters defines hypothesis space size
    n_params = X_poly.shape[1]
    
    print(f"\nDegree {degree} polynomials:")
    print(f"  Hypothesis space: all polynomials up to degree {degree}")
    print(f"  Number of parameters: {n_params}")
    print(f"  Training MSE: {mse:.4f}")
    print(f"  Selected hypothesis: p(x) = {model.intercept_:.3f}", end="")
    for i, coef in enumerate(model.coef_[1:], 1):
        if abs(coef) > 0.001:
            print(f" + {coef:.3f}x^{i}", end="")
    print()
 
# Key insight: As hypothesis space grows, training error decreases
# But at degree 10, we're likely overfitting to noise

The Search is Not Always Exhaustive

Hypothesis Space Size and Generalization

Here's a fundamental tension in machine learning:

Larger hypothesis spaces → More expressive, can fit more patterns But also → More risk of fitting noise, worse generalization

This isn't just intuitive—it's mathematically formalized in learning theory.

The Generalization Bound (Informal):

With probability at least $1 - \delta$, for any hypothesis $h \in \mathcal{H}$:

$$\text{True Error}(h) \leq \text{Training Error}(h) + \text{Complexity Term}(|\mathcal{H}|, n, \delta)$$

where:

True Error = error on unseen data (what we care about)
Training Error = error on training data (what we observe)
Complexity Term = penalty that grows with hypothesis space complexity
$n$ = number of training examples

The complexity term depends on measures like:

VC dimension (for classification): Roughly, the largest set of points the hypothesis class can shatter
Rademacher complexity: How well the hypothesis class can fit random noise
Number of parameters: Crude but often useful proxy

Implications:

Simpler hypothesis spaces generalize better (all else equal)
- If two hypothesis spaces both contain the true function, the smaller one will typically generalize better
More training data allows larger hypothesis spaces
- The complexity term shrinks with $n$, so you can afford more expressiveness
Regularization shrinks the effective hypothesis space
- Even if $\mathcal{H}$ is large, regularization restricts which hypotheses are reachable
The right hypothesis space matches the problem complexity
- Too small: can't capture the pattern (underfitting)
- Too large: captures noise along with pattern (overfitting)

Underfitting (ℋ too small)

•High training error
•High test error
•Model too simple for the data
•The true function is not in ℋ
•Adding complexity helps
•Also called 'high bias'

Overfitting (ℋ too large)

•Low training error
•High test error
•Model too complex for the data amount
•Fitting noise rather than signal
•Adding regularization helps
•Also called 'high variance'

hypothesis_complexity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
 
# Generate data
np.random.seed(42)
n = 30
X = np.linspace(0, 1, n).reshape(-1, 1)
y_true = np.sin(2 * np.pi * X.ravel())
y = y_true + 0.3 * np.random.randn(n)
 
# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
 
# Try different hypothesis space sizes (polynomial degrees)
degrees = range(0, 15)
train_errors = []
test_errors = []
 
for degree in degrees:
    poly = PolynomialFeatures(degree=degree)
    X_train_poly = poly.fit_transform(X_train)
    X_test_poly = poly.transform(X_test)
    
    model = LinearRegression()
    model.fit(X_train_poly, y_train)
    
    train_errors.append(mean_squared_error(y_train, model.predict(X_train_poly)))
    test_errors.append(mean_squared_error(y_test, model.predict(X_test_poly)))
 
# Plot the relationship between hypothesis space complexity and error
plt.figure(figsize=(10, 6))
plt.plot(degrees, train_errors, 'b-o', label='Training Error', markersize=8)
plt.plot(degrees, test_errors, 'r-o', label='Test Error', markersize=8)
plt.axvline(x=3, color='green', linestyle='--', label='Optimal Complexity', alpha=0.7)
 
plt.fill_between([0, 2], [0]*2, [1]*2, alpha=0.1, color='blue', label='Underfitting Zone')
plt.fill_between([6, 14], [0]*2, [1]*2, alpha=0.1, color='red', label='Overfitting Zone')
 
plt.xlabel('Polynomial Degree (Hypothesis Space Complexity)')
plt.ylabel('Mean Squared Error')
plt.title('Hypothesis Space Size vs. Generalization Error')
plt.legend()
plt.yscale('log')
plt.ylim(0.001, 10)
plt.grid(True, alpha=0.3)
plt.show()
 
print("Key observations:")
print("- Training error monotonically decreases with complexity")
print("- Test error decreases, then increases (overfitting)")
print("- Sweet spot: degree 3-4 (matches true sine-like pattern)")

Inductive Bias: The Assumptions We Make

Every learning algorithm embodies inductive biases—assumptions about what kinds of hypotheses are preferred. Without these biases, learning would be impossible.

Why Bias is Necessary:

Consider a classification problem with 10 training examples. There are infinitely many hypotheses consistent with these 10 points:

One that classifies exactly these 10 correctly (and labels everything else positive)
Another that classifies these 10 correctly (and labels everything else negative)
Infinitely many more with arbitrary behavior outside the training points

All have zero training error. How do we choose? We need preferences—biases—about which hypotheses are more likely to generalize.

Common Inductive Biases:

Types of Inductive Bias

•Smoothness: Nearby inputs should have similar outputs (k-NN, RBF kernels, Gaussian Processes)
•Linearity: Relationships are approximately linear (linear regression, perceptrons)
•Simplicity/Parsimony: Prefer simpler explanations; Occam's Razor (regularization, decision tree pruning)
•Sparsity: Only a few features matter (Lasso, sparse autoencoders)
•Translation Invariance: Patterns are the same regardless of position (convolutional neural networks)
•Temporal Coherence: Sequential data has local dependencies (RNNs, Hidden Markov Models)
•Hierarchical Composition: Complex patterns are built from simpler ones (deep networks)
•Low Rank: Data lies on a lower-dimensional manifold (PCA, matrix factorization)

Matching Bias to Problem:

The art of machine learning includes choosing algorithms whose inductive biases match the problem structure:

Problem Domain	Useful Bias	Suggested Algorithms
Image recognition	Spatial hierarchy, translation invariance	CNNs
Tabular data	Feature interactions, nonlinearity	Gradient boosting, random forests
Natural language	Sequential dependencies, compositionality	Transformers, RNNs
Physical simulations	Smoothness, conservation laws	Physics-informed neural networks
Graph-structured data	Permutation invariance, locality	Graph neural networks
Time series	Temporal smoothness, periodicity	RNNs, temporal convolutions

When the inductive bias matches the problem, learning is efficient and generalization is good. When it doesn't, even vast amounts of data may not help.

The No Free Lunch Theorem

Realizable vs Agnostic Learning

Learning theory distinguishes two settings based on whether the true function is in our hypothesis space.

Realizable Setting:

The true function $f^*$ that generated the data is contained in $\mathcal{H}$:

$$f^* \in \mathcal{H}$$

In this setting:

There exists a hypothesis with zero true error
Learning can find a perfect predictor (with enough data)
Theoretical guarantees are strongest

Agnostic Setting:

We make no assumption that $f^* \in \mathcal{H}$:

The best we can hope for is to find the hypothesis $h^*$ that minimizes error among all $h \in \mathcal{H}$:

$$h^* = \arg\min_{h \in \mathcal{H}} \text{True Error}(h)$$

In this setting:

Even the best hypothesis has nonzero error (approximation error)
We compete against the best-in-class, not perfection
More realistic for complex real-world problems

Realizable Learning

•True function is in ℋ
•Zero error is achievable
•Sample complexity bounds are tighter
•Consistent learners exist
•"We know the right model form"
•Example: Linear data + linear model

Agnostic Learning

•True function may not be in ℋ
•Best-in-class has nonzero error
•Bounds include approximation error
•Compete with best hypothesis in ℋ
•"We're approximating reality"
•Example: Complex data + simple model

Decomposing Error:

In the agnostic setting, the error of a learned hypothesis $\hat{h}$ can be decomposed:

$$\text{Error}(\hat{h}) = \underbrace{\text{Error}(h^)}_{\text{Approximation Error}} + \underbrace{(\text{Error}(\hat{h}) - \text{Error}(h^)}_{\text{Estimation Error}}$$

where:

Approximation Error: Error of the best hypothesis in $\mathcal{H}$. This is zero in the realizable case. Reduced by making $\mathcal{H}$ larger.
Estimation Error: Difference between the learned hypothesis and the best-in-class. Reduced by getting more data or making $\mathcal{H}$ smaller.

This is the essence of the bias-variance tradeoff viewed through hypothesis spaces:

Large $\mathcal{H}$: Low approximation error, high estimation error (variance)
Small $\mathcal{H}$: High approximation error (bias), low estimation error

Practical Reality is Agnostic

Restricting the Hypothesis Space

Given the tradeoff between expressiveness and generalization, how do we effectively restrict the hypothesis space? Several mechanisms work together:

1. Architecture Choice:

The structure of the model defines $\mathcal{H}$:

A 2-layer neural network defines a different $\mathcal{H}$ than a 5-layer one
A decision tree with max depth 3 is in a smaller $\mathcal{H}$ than one with max depth 10
A polynomial of degree 2 is simpler than degree 20

2. Regularization:

Penalizing certain hypotheses makes them effectively unreachable:

L2 regularization (Ridge): penalizes large weights → encourages simpler, smoother functions
L1 regularization (Lasso): penalizes non-zero weights → encourages sparse solutions
Dropout: randomly dropping neurons → encourages distributed representations
Early stopping: stop before full convergence → limits hypothesis complexity

3. Hyperparameter Settings:

Learning rate: affects which minima are reachable
Batch size: larger batches can lead to sharper minima
Tree depth: limits decision tree expressiveness
k in k-NN: higher k means smoother decision boundaries

restricting_hypothesis_space.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
 
# Generate noisy data
np.random.seed(42)
n = 30
X = np.linspace(0, 1, n).reshape(-1, 1)
y_true = np.sin(4 * np.pi * X.ravel())
y = y_true + 0.5 * np.random.randn(n)
 
# High-degree polynomial = large hypothesis space
# Without regularization: overfits
# With regularization: restricted to simpler functions
 
degree = 15
alphas = [0, 0.0001, 0.001, 0.1, 10]
 
fig, axes = plt.subplots(1, len(alphas), figsize=(20, 4))
X_plot = np.linspace(0, 1, 100).reshape(-1, 1)
 
for ax, alpha in zip(axes, alphas):
    # Create pipeline: polynomial features + ridge regression
    pipeline = Pipeline([
        ('poly', PolynomialFeatures(degree=degree)),
        ('ridge', Ridge(alpha=alpha))
    ])
    
    pipeline.fit(X, y)
    y_pred = pipeline.predict(X_plot)
    
    # Get coefficient magnitudes
    coef_norm = np.linalg.norm(pipeline.named_steps['ridge'].coef_)
    
    ax.scatter(X, y, color='blue', s=50, label='Data', alpha=0.6)
    ax.plot(X_plot, y_pred, 'r-', linewidth=2, label='Model')
    ax.plot(X_plot, np.sin(4 * np.pi * X_plot.ravel()), 'g--', 
            linewidth=1, label='True', alpha=0.7)
    ax.set_title(f'α = {alpha}\n||w|| = {coef_norm:.1f}')
    ax.set_ylim(-3, 3)
    ax.legend(fontsize=8)
 
plt.suptitle('Regularization Restricts the Effective Hypothesis Space', fontsize=14)
plt.tight_layout()
plt.show()
 
# Key insight: Same hypothesis space (degree-15 polynomials)
# But regularization makes most of that space "expensive" to reach
# Effectively restricting the learner to simpler functions

The Regularization Perspective

Summary: The Hypothesis Space Framework

We've established the theoretical framework for understanding what machine learning algorithms actually search over and how this shapes what they can learn.

Key Takeaways

•The hypothesis space is the learning universe — It defines all possible functions the algorithm can consider. Learning cannot discover solutions outside this space.
•Algorithms define hypothesis spaces — Linear regression searches over linear functions. Neural networks search over compositions of nonlinear transformations. Each choice has implications.
•Larger is not always better — Larger hypothesis spaces can represent more patterns but are harder to search and more prone to overfitting.
•Inductive bias is essential — Without assumptions about which hypotheses are preferable, learning would be impossible. Match the bias to the problem.
•Realizable vs agnostic — We usually operate in the agnostic setting where we're approximating, not discovering, the true function.
•Regularization restricts effectively — Regularization and other constraints shrink the effective hypothesis space, improving generalization.
•The bias-variance tradeoff — Small ℋ = high bias (underfitting); Large ℋ = high variance (overfitting). The sweet spot depends on data quantity and problem complexity.

What's Next:

Page Complete

3 / 5