Loading learning content...
When we say a machine learning algorithm 'learns' a model, what does that actually mean? What are the possible things it could learn? How does it decide which one to pick?
These questions lead us to one of the most fundamental concepts in machine learning theory: the hypothesis space.
The hypothesis space is the set of all possible functions that a learning algorithm is allowed to consider. It's the universe within which learning takes place. Understanding hypothesis spaces illuminates:
This is where machine learning moves from practical engineering to rigorous mathematics—and understanding this transition is essential for developing principled intuition about ML.
By the end of this page, you will understand: • The formal definition of hypothesis space • How different algorithms define different hypothesis spaces • The relationship between hypothesis space size and generalization • The bias-variance perspective on hypothesis selection • Why restricting the hypothesis space is often beneficial
Let's establish the formal vocabulary.
Definition: Hypothesis
A hypothesis $h$ is a function that maps inputs to outputs:
$$h: \mathcal{X} \rightarrow \mathcal{Y}$$
where $\mathcal{X}$ is the input space (feature space) and $\mathcal{Y}$ is the output space (label space).
For classification: $h(\mathbf{x})$ predicts a class label For regression: $h(\mathbf{x})$ predicts a real value
Definition: Hypothesis Space (or Hypothesis Class)
The hypothesis space $\mathcal{H}$ is the set of all hypotheses that the learning algorithm considers:
$$\mathcal{H} = {h : \mathcal{X} \rightarrow \mathcal{Y}}$$
This set is determined by the choice of algorithm and any constraints we impose. The learning problem becomes: find the hypothesis $h^ \in \mathcal{H}$ that best fits the training data and generalizes well.*
Learning cannot find a solution outside the hypothesis space. If the true underlying function (the one that generated the data) is not contained in ℋ, the best we can do is find the closest approximation within ℋ. The choice of ℋ fundamentally limits what learning can achieve.
Example: Linear Classifiers in 2D
Consider binary classification with 2D inputs $\mathbf{x} = [x_1, x_2]^T$.
A linear classifier hypothesis takes the form:
$$h_{\mathbf{w},b}(\mathbf{x}) = \text{sign}(w_1 x_1 + w_2 x_2 + b) = \text{sign}(\mathbf{w}^T \mathbf{x} + b)$$
The hypothesis space of all linear classifiers in 2D is:
$$\mathcal{H}{linear} = {h{\mathbf{w},b} : \mathbf{w} \in \mathbb{R}^2, b \in \mathbb{R}}$$
This is parameterized by 3 real numbers: $w_1$, $w_2$, and $b$. Each setting of these parameters defines a different hypothesis (a different line in 2D that separates classes).
What's NOT in This Hypothesis Space?
If the true pattern requires a circular boundary, no linear classifier will ever find it—it's outside the hypothesis space.
12345678910111213141516171819202122232425262728293031323334353637383940414243
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.linear_model import LogisticRegressionfrom sklearn.datasets import make_circles, make_blobs # Create two datasets: one linearly separable, one notnp.random.seed(42) # Linearly separable dataX_linear, y_linear = make_blobs(n_samples=200, centers=2, cluster_std=1.5, random_state=42) # Non-linearly separable data (circles inside circles)X_circles, y_circles = make_circles(n_samples=200, noise=0.1, factor=0.3, random_state=42) fig, axes = plt.subplots(1, 2, figsize=(12, 5)) for ax, X, y, title in [(axes[0], X_linear, y_linear, 'Linearly Separable'), (axes[1], X_circles, y_circles, 'Not Linearly Separable (Circles)')]: # Fit linear classifier clf = LogisticRegression() clf.fit(X, y) # Plot decision boundary xx, yy = np.meshgrid(np.linspace(X[:, 0].min()-1, X[:, 0].max()+1, 100), np.linspace(X[:, 1].min()-1, X[:, 1].max()+1, 100)) Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape) ax.contourf(xx, yy, Z, alpha=0.3, cmap='RdBu') ax.scatter(X[y==0, 0], X[y==0, 1], c='blue', label='Class 0', edgecolors='black') ax.scatter(X[y==1, 0], X[y==1, 1], c='red', label='Class 1', edgecolors='black') ax.set_title(f'{title}\nAccuracy: {clf.score(X, y):.2f}') ax.legend() plt.suptitle('The Hypothesis Space of Linear Classifiers', fontsize=14)plt.tight_layout()plt.show() # Key insight: Even with perfect optimization, the circles dataset # cannot be classified well because the TRUE boundary is circular,# which is OUTSIDE the linear hypothesis spaceDifferent machine learning algorithms define different hypothesis spaces. Understanding what each algorithm can and cannot represent is crucial for algorithm selection.
| Algorithm | Hypothesis Space $\mathcal{H}$ | Size/Complexity | Can Represent |
|---|---|---|---|
| Linear Regression | ${\mathbf{w}^T\mathbf{x} + b : \mathbf{w} \in \mathbb{R}^d, b \in \mathbb{R}}$ | $d + 1$ parameters (finite-dimensional) | Linear relationships only |
| Polynomial Regression (degree p) | ${\sum_{i=0}^{p} w_i x^i : \mathbf{w} \in \mathbb{R}^{p+1}}$ | $p + 1$ parameters | Polynomial curves up to degree $p$ |
| Decision Tree (depth d) | All trees with depth ≤ $d$ | At most $2^d$ leaves | Axis-aligned rectangular partitions |
| k-NN (k neighbors) | Implicitly defined by training data | Non-parametric (grows with data) | Arbitrary complex boundaries (if k is small) |
| SVM with Linear Kernel | Linear separating hyperplanes | $d + 1$ parameters | Linearly separable patterns |
| SVM with RBF Kernel | Functions in infinite-dimensional RKHS | Infinite-dimensional (but regularized) | Very complex, smooth boundaries |
| Neural Network (fixed architecture) | ${f_\theta : \theta \in \mathbb{R}^p}$ | $p$ parameters (can be millions) | Universal approximators (with enough capacity) |
Parametric vs Non-Parametric Models:
Parametric models have a fixed-size hypothesis space regardless of training data size:
Advantage: Efficient inference, doesn't grow with data Disadvantage: May not capture complex patterns
Non-parametric models have hypothesis complexity that grows with data:
Advantage: Can capture arbitrarily complex patterns Disadvantage: Complexity grows, can overfit easily
Neural networks with at least one hidden layer and nonlinear activation can approximate any continuous function to arbitrary precision—given enough hidden units. This means neural network hypothesis spaces can be made arbitrarily expressive. However, 'can in principle' and 'can in practice' are very different. Training challenges, finite data, and computational limits mean we rarely achieve this theoretical expressiveness.
Learning can be viewed as a search problem: given training data $\mathcal{D}$, find the hypothesis $h^* \in \mathcal{H}$ that best explains the data and will generalize well.
The Learning Objective:
Define a loss function $L(h, \mathcal{D})$ that measures how poorly hypothesis $h$ fits the data. Learning seeks:
$$h^* = \arg\min_{h \in \mathcal{H}} L(h, \mathcal{D})$$
How the Search Works Depends on $\mathcal{H}$:
Finite Hypothesis Spaces: Can enumerate all possibilities (though often infeasible)
Continuous, Convex Hypothesis Spaces: Gradient descent finds global optimum
Continuous, Non-Convex Hypothesis Spaces: Local optimization, hope for good local minima
Discrete, Combinatorial Spaces: Heuristic search, approximations
12345678910111213141516171819202122232425262728293031323334353637383940414243
import numpy as npfrom sklearn.linear_model import LinearRegressionfrom sklearn.preprocessing import PolynomialFeaturesfrom sklearn.metrics import mean_squared_error # Generate data from a quadratic function with noisenp.random.seed(42)X = np.linspace(-3, 3, 30).reshape(-1, 1)y_true = 0.5 * X.ravel()**2 + 2 # True function: y = 0.5x² + 2y = y_true + np.random.randn(30) * 0.5 # Add noise # Search over hypothesis spaces of different polynomial degreesprint("Searching over hypothesis spaces of increasing complexity")print("=" * 60) for degree in [0, 1, 2, 3, 5, 10]: # Create polynomial features → defines hypothesis space poly = PolynomialFeatures(degree=degree) X_poly = poly.fit_transform(X) # Search within this hypothesis space (closed-form for linear regression) model = LinearRegression() model.fit(X_poly, y) # The selected hypothesis y_pred = model.predict(X_poly) mse = mean_squared_error(y, y_pred) # Number of parameters defines hypothesis space size n_params = X_poly.shape[1] print(f"\nDegree {degree} polynomials:") print(f" Hypothesis space: all polynomials up to degree {degree}") print(f" Number of parameters: {n_params}") print(f" Training MSE: {mse:.4f}") print(f" Selected hypothesis: p(x) = {model.intercept_:.3f}", end="") for i, coef in enumerate(model.coef_[1:], 1): if abs(coef) > 0.001: print(f" + {coef:.3f}x^{i}", end="") print() # Key insight: As hypothesis space grows, training error decreases# But at degree 10, we're likely overfitting to noiseFor most practical hypothesis spaces, we cannot explore every possibility. Gradient descent, greedy algorithms, and other heuristics explore a path through the space, not the whole space. The hypothesis we find depends on initialization, hyperparameters, and luck. Two runs of neural network training with different random seeds can find very different hypotheses—all in the same hypothesis space.
Here's a fundamental tension in machine learning:
Larger hypothesis spaces → More expressive, can fit more patterns But also → More risk of fitting noise, worse generalization
This isn't just intuitive—it's mathematically formalized in learning theory.
The Generalization Bound (Informal):
With probability at least $1 - \delta$, for any hypothesis $h \in \mathcal{H}$:
$$\text{True Error}(h) \leq \text{Training Error}(h) + \text{Complexity Term}(|\mathcal{H}|, n, \delta)$$
where:
The complexity term depends on measures like:
Implications:
Simpler hypothesis spaces generalize better (all else equal)
More training data allows larger hypothesis spaces
Regularization shrinks the effective hypothesis space
The right hypothesis space matches the problem complexity
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.preprocessing import PolynomialFeaturesfrom sklearn.linear_model import LinearRegressionfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import mean_squared_error # Generate datanp.random.seed(42)n = 30X = np.linspace(0, 1, n).reshape(-1, 1)y_true = np.sin(2 * np.pi * X.ravel())y = y_true + 0.3 * np.random.randn(n) # SplitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Try different hypothesis space sizes (polynomial degrees)degrees = range(0, 15)train_errors = []test_errors = [] for degree in degrees: poly = PolynomialFeatures(degree=degree) X_train_poly = poly.fit_transform(X_train) X_test_poly = poly.transform(X_test) model = LinearRegression() model.fit(X_train_poly, y_train) train_errors.append(mean_squared_error(y_train, model.predict(X_train_poly))) test_errors.append(mean_squared_error(y_test, model.predict(X_test_poly))) # Plot the relationship between hypothesis space complexity and errorplt.figure(figsize=(10, 6))plt.plot(degrees, train_errors, 'b-o', label='Training Error', markersize=8)plt.plot(degrees, test_errors, 'r-o', label='Test Error', markersize=8)plt.axvline(x=3, color='green', linestyle='--', label='Optimal Complexity', alpha=0.7) plt.fill_between([0, 2], [0]*2, [1]*2, alpha=0.1, color='blue', label='Underfitting Zone')plt.fill_between([6, 14], [0]*2, [1]*2, alpha=0.1, color='red', label='Overfitting Zone') plt.xlabel('Polynomial Degree (Hypothesis Space Complexity)')plt.ylabel('Mean Squared Error')plt.title('Hypothesis Space Size vs. Generalization Error')plt.legend()plt.yscale('log')plt.ylim(0.001, 10)plt.grid(True, alpha=0.3)plt.show() print("Key observations:")print("- Training error monotonically decreases with complexity")print("- Test error decreases, then increases (overfitting)")print("- Sweet spot: degree 3-4 (matches true sine-like pattern)")Every learning algorithm embodies inductive biases—assumptions about what kinds of hypotheses are preferred. Without these biases, learning would be impossible.
Why Bias is Necessary:
Consider a classification problem with 10 training examples. There are infinitely many hypotheses consistent with these 10 points:
All have zero training error. How do we choose? We need preferences—biases—about which hypotheses are more likely to generalize.
Common Inductive Biases:
Matching Bias to Problem:
The art of machine learning includes choosing algorithms whose inductive biases match the problem structure:
| Problem Domain | Useful Bias | Suggested Algorithms |
|---|---|---|
| Image recognition | Spatial hierarchy, translation invariance | CNNs |
| Tabular data | Feature interactions, nonlinearity | Gradient boosting, random forests |
| Natural language | Sequential dependencies, compositionality | Transformers, RNNs |
| Physical simulations | Smoothness, conservation laws | Physics-informed neural networks |
| Graph-structured data | Permutation invariance, locality | Graph neural networks |
| Time series | Temporal smoothness, periodicity | RNNs, temporal convolutions |
When the inductive bias matches the problem, learning is efficient and generalization is good. When it doesn't, even vast amounts of data may not help.
The No Free Lunch theorem proves that no algorithm is universally better than all others across all possible problems. What makes an algorithm good for one problem makes it worse for another. The goal is matching the algorithm's inductive bias to your specific problem—not finding a 'best' algorithm.
Learning theory distinguishes two settings based on whether the true function is in our hypothesis space.
Realizable Setting:
The true function $f^*$ that generated the data is contained in $\mathcal{H}$:
$$f^* \in \mathcal{H}$$
In this setting:
Agnostic Setting:
We make no assumption that $f^* \in \mathcal{H}$:
The best we can hope for is to find the hypothesis $h^*$ that minimizes error among all $h \in \mathcal{H}$:
$$h^* = \arg\min_{h \in \mathcal{H}} \text{True Error}(h)$$
In this setting:
Decomposing Error:
In the agnostic setting, the error of a learned hypothesis $\hat{h}$ can be decomposed:
$$\text{Error}(\hat{h}) = \underbrace{\text{Error}(h^)}_{\text{Approximation Error}} + \underbrace{(\text{Error}(\hat{h}) - \text{Error}(h^)}_{\text{Estimation Error}}$$
where:
This is the essence of the bias-variance tradeoff viewed through hypothesis spaces:
In practice, we almost never know the true function, so the agnostic setting is more realistic. We choose hypothesis spaces hoping they're expressive enough to capture the essential patterns while remaining constrained enough to enable generalization. It's always a guess—informed by domain knowledge and validated empirically.
Given the tradeoff between expressiveness and generalization, how do we effectively restrict the hypothesis space? Several mechanisms work together:
1. Architecture Choice:
The structure of the model defines $\mathcal{H}$:
2. Regularization:
Penalizing certain hypotheses makes them effectively unreachable:
3. Hyperparameter Settings:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.linear_model import Ridge, Lassofrom sklearn.preprocessing import PolynomialFeaturesfrom sklearn.pipeline import Pipeline # Generate noisy datanp.random.seed(42)n = 30X = np.linspace(0, 1, n).reshape(-1, 1)y_true = np.sin(4 * np.pi * X.ravel())y = y_true + 0.5 * np.random.randn(n) # High-degree polynomial = large hypothesis space# Without regularization: overfits# With regularization: restricted to simpler functions degree = 15alphas = [0, 0.0001, 0.001, 0.1, 10] fig, axes = plt.subplots(1, len(alphas), figsize=(20, 4))X_plot = np.linspace(0, 1, 100).reshape(-1, 1) for ax, alpha in zip(axes, alphas): # Create pipeline: polynomial features + ridge regression pipeline = Pipeline([ ('poly', PolynomialFeatures(degree=degree)), ('ridge', Ridge(alpha=alpha)) ]) pipeline.fit(X, y) y_pred = pipeline.predict(X_plot) # Get coefficient magnitudes coef_norm = np.linalg.norm(pipeline.named_steps['ridge'].coef_) ax.scatter(X, y, color='blue', s=50, label='Data', alpha=0.6) ax.plot(X_plot, y_pred, 'r-', linewidth=2, label='Model') ax.plot(X_plot, np.sin(4 * np.pi * X_plot.ravel()), 'g--', linewidth=1, label='True', alpha=0.7) ax.set_title(f'α = {alpha}\n||w|| = {coef_norm:.1f}') ax.set_ylim(-3, 3) ax.legend(fontsize=8) plt.suptitle('Regularization Restricts the Effective Hypothesis Space', fontsize=14)plt.tight_layout()plt.show() # Key insight: Same hypothesis space (degree-15 polynomials)# But regularization makes most of that space "expensive" to reach# Effectively restricting the learner to simpler functionsRegularization doesn't change the hypothesis space—it changes the search process. With regularization, hypotheses with large parameters become very 'expensive' to reach, so the optimizer settles for simpler hypotheses that trade a little training fit for much simpler representations.
We've established the theoretical framework for understanding what machine learning algorithms actually search over and how this shapes what they can learn.
What's Next:
We've established what models search over and how to evaluate them. But how do we actually measure 'fit'? The next page introduces loss functions—the mathematical quantification of model error that drives learning.
You now understand the hypothesis space—the set of all possible functions a learning algorithm considers. You can reason about how algorithm choice defines the hypothesis space, why restricting it aids generalization, and how inductive bias shapes learning. Next, we'll explore loss functions.