Loading content...
The Universal Approximation Theorem tells us that neural networks can approximate any continuous function, but says nothing about how efficiently. This is the gap that approximation bounds fill—they provide quantitative answers to fundamental questions:
These bounds are not merely academic curiosities. They guide practical decisions: how large should your network be? Will you hit the curse of dimensionality? What accuracy can you expect given your parameter budget? Understanding approximation theory transforms architecture design from guesswork into principled engineering.
By the end of this page, you will understand Barron's celebrated approximation bounds, grasp the conditions that determine whether a problem suffers from the curse of dimensionality, appreciate how smoothness and structural assumptions affect approximation rates, and be able to apply this knowledge to estimate network size requirements.
Defining Approximation Quality:
Given a target function $f$ and an approximating network $G_N$ with $N$ parameters, the approximation error is typically measured as:
$$\text{Error} = |f - G_N|$$
where the norm $|\cdot|$ can be:
The Central Question:
How does optimal approximation error scale with network size $N$?
$$\epsilon^*(N) = \inf_{G \text{ with } \leq N \text{ params}} |f - G|$$
We seek bounds of the form $\epsilon^*(N) = O(N^{-\alpha})$ for some exponent $\alpha > 0$. The exponent $\alpha$ is the approximation rate—larger is better.
Classical Lower Bounds:
Approximation theory establishes fundamental limits. For smooth functions on the unit cube $[0,1]^n$:
Theorem (Classical Approximation Theory):
Let $f$ be $s$-times continuously differentiable. Then for any approximation method using $N$ free parameters: $$|f - G_N|_\infty \geq C \cdot N^{-s/n}$$
The exponent $s/n$ appears frequently—$s$ measures smoothness; $n$ measures dimensionality. This ratio determines approximation difficulty.
The Curse of Dimensionality Quantified:
For fixed smoothness $s$, doubling the dimension $n$ squares the parameter count needed for fixed error: $$N(\epsilon, n) \propto \epsilon^{-n/s}$$
This exponential dependence on $n$ is the curse of dimensionality. Methods achieving this classical rate are called "dimension-affected."
For arbitrary smooth functions, no method escapes the curse. A function on ℝ^100 with two continuous derivatives requires O(ε^{-50}) parameters for ε accuracy—astronomically large for reasonable ε. This is why raw high-dimensional problems are intractable without structural assumptions.
| Function Class | Smoothness | Approximation Rate | Parameters for ε error |
|---|---|---|---|
| Continuous | None | No rate guarantee | May be arbitrarily large |
| Lipschitz | Bounded gradient | $O(N^{-1/n})$ | $O(\epsilon^{-n})$ |
| $C^s$ (s derivatives) | s derivatives | $O(N^{-s/n})$ | $O(\epsilon^{-n/s})$ |
| Analytic | Infinite derivatives | $O(e^{-cN^{1/n}})$ | $O((\log 1/\epsilon)^n)$ |
| Barron class | Finite spectral norm | $O(N^{-1/2})$ | $O(\epsilon^{-2})$ |
Andrew Barron's 1993 theorem is one of the most celebrated results in neural network approximation theory. It identifies a class of functions for which neural networks achieve dimension-independent approximation rates—completely avoiding the curse of dimensionality.
The Barron Class:
A function $f: \mathbb{R}^n \to \mathbb{R}$ belongs to the Barron class if there exists a signed measure $\mu$ on $\mathbb{R}^n \times \mathbb{R}$ such that:
$$f(x) = \int \sigma(w^T x + b) , d\mu(w, b)$$
and the spectral norm is finite: $$C_f = \int |v| \cdot (|w|_1 + |b|) , d|\mu|(w, v, b) < \infty$$
Informally, $f$ is in the Barron class if it can be expressed as an "integral" of sigmoidal neurons with finite total variation. The constant $C_f$ measures how "spread out" this representation is.
For functions with integrable Fourier transforms, the Barron norm is related to the first moment of the Fourier transform: $C_f \propto \int |\omega| |\hat{f}(\omega)| d\omega$. Functions with rapidly decaying Fourier transforms (smooth but not too oscillatory) have small Barron norms.
Theorem (Barron, 1993):
Let $f$ be in the Barron class with spectral norm $C_f$, and let the domain be the unit ball $B_n$ in $\mathbb{R}^n$. For any $N \geq 1$, there exists a single-hidden-layer neural network $G_N$ with $N$ hidden units such that:
$$|f - G_N|_{L^2(B_n)}^2 \leq \frac{C_f^2}{N}$$
The approximation error in $L^2$ norm is $O(1/\sqrt{N})$, independent of dimension $n$.
The Significance:
This is revolutionary. For Barron class functions:
Contrast with classical approximation where $N = O(\epsilon^{-n/s})$—exponential in dimension.
Proof Idea:
Barron's proof uses a probabilistic argument:
Represent $f$ as expectation: Since $f = \int \sigma(w^T x + b) , d\mu$, we can write $f(x) = \mathbb{E}_{(w,b) \sim \mu}[\sigma(w^T x + b)]$
Monte Carlo approximation: Approximate this expectation by sampling $N$ neurons: $$G_N(x) = \frac{1}{N} \sum_{i=1}^{N} v_i \sigma(w_i^T x + b_i)$$ where $(w_i, b_i)$ are sampled from (a normalized version of) $\mu$.
Variance bound: By the law of large numbers, the mean-squared error between $G_N$ and $f$ is $O(\text{Var}/N) = O(C_f^2/N)$.
Optimization converts to existence: While sampling gives a random network, optimization can find one at least as good, so good networks exist.
The key insight is that neural networks are computing a kind of Monte Carlo integral—sampling from a function space rather than a probability space.
Classical methods place basis functions on a fixed grid, requiring exponentially many grid points in high dimensions. Neural networks learn WHERE to place basis functions (neurons), adapting to the function's structure. For Barron class functions, the learning can concentrate neurons where they matter most, avoiding the grid explosion.
Barron's theorem applies to single-hidden-layer networks. Recent work extends approximation theory to deep architectures, revealing significant advantages from depth.
Depth Hierarchy Results:
Theorem (Yarotsky, 2017):
For any $\epsilon > 0$ and $s > 0$, there exists a ReLU network with $O(\epsilon^{-n/s} \log(1/\epsilon))$ parameters that approximates any $s$-times differentiable function on $[0,1]^n$ within $L^\infty$ error $\epsilon$.
The depth is $O(\log(1/\epsilon))$, and the width is $O(\epsilon^{-n/s})$.
This matches classical lower bounds up to logarithmic factors, showing deep ReLU networks are near-optimal for smooth functions.
Composition Structure:
Deep networks excel at approximating compositional functions—functions built from nested compositions: $$f(x) = f_L(f_{L-1}(\cdots f_1(x) \cdots))$$
Theorem (Poggio et al., 2017):
If $f$ is a compositional function of depth $L$ where each component $f_i: \mathbb{R}^{k_i} \to \mathbb{R}^{k_{i+1}}$ is smooth, then a deep network of depth $O(L)$ can approximate $f$ with: $$N = O\left(L \cdot \epsilon^{-\max_i k_i / s}\right) \text{ parameters}$$
A shallow network would require $O\left(\epsilon^{-n/s}\right)$ parameters where $n = k_1$ is the input dimension.
If $\max_i k_i \ll n$ (each layer processes few variables at once), depth provides exponential savings.
Real-world functions often have compositional structure: an image classifier computes edges, then textures, then parts, then objects—a deep composition of relatively simple functions. Deep networks match this structure, achieving efficient approximation. This is why depth works well in practice, not just theory.
ReLU-Specific Results:
ReLU networks have been studied intensively due to their practical importance:
Theorem (Telgarsky, 2016):
The function $x \mapsto x^2$ on $[0,1]$ can be approximated to within $\epsilon$ by a ReLU network of depth $O(\log(1/\epsilon))$ and width $O(1)$.
This is remarkable—a fixed-width network, with depth logarithmic in precision, approximates a polynomial. Shallow networks would need width $O(1/\epsilon)$ for the same task.
Approximation of Polynomials:
ReLU networks can approximate monomials $x^k$ and thus all polynomials. The construction uses the identity: $$x^2 = \lim_{n \to \infty} 4^n \cdot g(g(\cdots g(x/4^n) \cdots))$$
where $g(t) = |t| - |t - 1/2| + |t - 1|$ is a piecewise linear "folding" function. Each layer of folding doubles the oscillations, achieving exponential approximation.
| Architecture | Target Class | Error Rate | Notes |
|---|---|---|---|
| Shallow, Barron class | Integral representation | $O(N^{-1/2})$ | Dimension-independent |
| Shallow, Sobolev $W^{s,p}$ | $s$-smooth functions | $O(N^{-s/n})$ | Curse of dimensionality |
| Deep ReLU, Sobolev | $s$-smooth functions | $O(N^{-s/n})$ | Matches classical, log factor improvement |
| Deep ReLU, compositional | Depth-$L$ composition | $O(L \cdot N^{-s/d_{\max}})$ | Exponential savings possible |
| Deep ReLU, polynomials | Degree-$k$ polynomial | $O(2^{-N})$ | Exponential convergence |
Upper bounds tell us what's achievable; lower bounds tell us what's necessary. Together, they establish optimality.
Fundamental Lower Bound:
Theorem: For any approximation method with $N$ free parameters, there exists a function $f \in W^{s,\infty}([0,1]^n)$ (the Sobolev space of $s$-smooth functions) such that:
$$|f - G_N|_\infty \geq c \cdot N^{-s/n}$$
where $c > 0$ is a constant depending only on $s$ and $n$.
Information-Theoretic Interpretation:
This lower bound is information-theoretic—it limits any method, not just neural networks. The bound reflects that smooth functions on $[0,1]^n$ form an "essentially $O(\epsilon^{-n/s})$-dimensional" set in some sense; fewer than this many parameters cannot distinguish all functions at resolution $\epsilon$.
Neural Network-Specific Lower Bounds:
Theorem (Yarotsky, 2017):
For ReLU networks approximating $s$-smooth functions on $[0,1]^n$, any network achieving $\epsilon$ error requires:
Width-Depth Tradeoffs Quantified:
These bounds reveal precise tradeoffs:
Lower bounds prove that SOME functions require many parameters. They don't claim ALL functions do. In practice, real-world functions may have structure (like compositionality) that makes them easier than the worst case. Lower bounds are pessimistic but important for understanding fundamental limits.
Optimality of Known Constructions:
Comparing upper and lower bounds establishes optimality:
| Function Class | Upper Bound | Lower Bound | Gap |
|---|---|---|---|
| Sobolev $W^{s,\infty}$ | $O(N^{-s/n})$ | $\Omega(N^{-s/n})$ | None (optimal) |
| Barron class | $O(N^{-1/2})$ | $\Omega(N^{-1/2})$ | None (optimal) |
| Analytic functions | $O(e^{-cN^{1/n}})$ | $\Omega(e^{-c'N^{1/n}})$ | Constant in exponent |
| Lipschitz | $O(N^{-1/n})$ | $\Omega(N^{-1/n})$ | None (optimal) |
For most classical function spaces, neural network approximation rates are optimal—we can't do fundamentally better with any method.
The Barron Lower Bound:
Theorem (Jones, 1992):
For functions in the Barron class with norm $C_f$, any approximation $G_N$ with $N$ neurons must satisfy:
$$\mathbb{E}[|f - G_N|^2] \geq \frac{c \cdot C_f^2}{N}$$
This proves Barron's upper bound is tight—$O(1/\sqrt{N})$ is the best possible rate for this function class.
Approximation rates depend critically on smoothness assumptions. Different notions of smoothness lead to different rates and enable different approximation strategies.
The Sobolev Scale:
Sobolev spaces $W^{s,p}(\Omega)$ measure smoothness via integrability of derivatives:
$$|f|{W^{s,p}} = \left( \sum{|\alpha| \leq s} |D^\alpha f|_p^p \right)^{1/p}$$
Higher $s$ = more derivatives = smoother function = faster approximation.
Key Sobolev Results:
Theorem: For $f \in W^{s,p}([0,1]^n)$ with $1 \leq p \leq \infty$: $$\inf_{G_N} |f - G_N|p \leq C \cdot N^{-s/n} \cdot |f|{W^{s,p}}$$
The rate $s/n$ captures the smoothness-dimension tradeoff.
Besov Spaces:
Besov spaces $B^s_{p,q}$ generalize Sobolev spaces with a finer scale:
Besov spaces are natural for adaptive approximation (like neural networks) because they capture inhomogeneous smoothness—functions that are smooth in some regions and rough in others.
Theorem (DeVore et al., 1989):
Adaptive methods (like neural networks) can exploit inhomogeneous smoothness, achieving rates depending on local rather than global regularity.
This is why neural networks often outperform classical methods: they automatically concentrate capacity where the function is complex.
A function might be smooth everywhere except near a sharp edge. Classical methods using fixed grids must refine everywhere to capture the edge. Neural networks can place neurons precisely at the edge, achieving the same accuracy with far fewer parameters. This adaptivity is key to their success on real data.
Reproducing Kernel Hilbert Spaces (RKHS):
Neural networks can also be analyzed through RKHS theory. Define the neural tangent kernel (NTK):
$$K(x, x') = \mathbb{E}{\theta \sim \text{init}} \left[ \langle \nabla\theta f_\theta(x), \nabla_\theta f_\theta(x') \rangle \right]$$
The RKHS associated with this kernel characterizes functions well-approximated by neural networks (in the lazy training regime).
Theorem (Arora et al., 2019):
In the infinite-width limit, gradient descent on neural networks is equivalent to kernel regression with the NTK. The approximation rate for the RKHS is: $$|f - G_N|^2 \leq \frac{|f|_\mathcal{H}^2}{N}$$
Table: Smoothness Spaces and Their Approximation Rates
| Space | Smoothness Notion | Rate | Comment |
|---|---|---|---|
| $C^s$ | Continuous derivatives | $O(N^{-s/n})$ | Classical, dimension-affected |
| $W^{s,p}$ | Sobolev (integrable derivatives) | $O(N^{-s/n})$ | Allows weaker pointwise regularity |
| $B^s_{p,q}$ | Besov (localized smoothness) | Rate depends on local regularity | Good for adaptive methods |
| Barron | Spectral decay | $O(N^{-1/2})$ | Dimension-independent |
| RKHS | Kernel-induced | $O(N^{-1/2})$ or better | Depends on kernel eigenvalue decay |
Input dimension is the critical factor determining whether neural network approximation is tractable. We've seen both horror stories (curse of dimensionality) and success stories (Barron class). Understanding when each applies is essential.
When Dimension Hurts:
The curse strikes when:
For such functions, $N = \Omega(\epsilon^{-n/s})$ parameters are necessary—exponential in $n$.
When Dimension Doesn't Hurt:
Several structural conditions avoid the curse:
1. Low Intrinsic Dimension: If the target function effectively depends on only $d \ll n$ coordinates (or $d$ linear combinations), rates depend on $d$, not $n$.
Theorem: If $f(x) = g(Ax)$ where $A \in \mathbb{R}^{d \times n}$ and $g: \mathbb{R}^d \to \mathbb{R}$, then approximation with Barron-class networks depends on $d$, not $n$.
2. Compositional Structure: As discussed, compositional functions $f = f_L \circ \cdots \circ f_1$ where each $f_i$ has limited input dimension avoid the curse.
3. Additive Structure: Functions of the form $f(x) = \sum_{i=1}^n g_i(x_i)$ (additive models) decompose into 1D problems with no curse.
4. Multiplicative/Interaction Limits: Functions with bounded interaction order (variables never appear in groups larger than $k$) have complexity exponential in $k$, not $n$.
The manifold hypothesis posits that real-world high-dimensional data (like images) lies on or near low-dimensional manifolds. If true, the effective dimension is the manifold dimension, not the ambient dimension. This explains why deep learning works on 1000×1000 images (10^6 pixels)—the manifold of natural images is far lower-dimensional.
The Effective Dimension:
Cohen et al. (2016) introduced methods to measure effective dimension in practice:
$$d_{\text{eff}} = \frac{(\sum_i \lambda_i)^2}{\sum_i \lambda_i^2}$$
where $\lambda_i$ are eigenvalues of the data covariance (or related quantities). If $d_{\text{eff}} \ll n$, the curse is mitigated.
Practical Implication:
When designing networks for high-dimensional inputs, ask:
If none of these hold and $n$ is large, expect to need very large networks or accept higher error.
Approximation bounds provide theoretical guidance for network sizing, but translating theory to practice requires care.
Rule-of-Thumb Network Sizing:
For a function of complexity $C$ (measured in some appropriate norm) requiring error $\epsilon$:
Shallow networks (1 hidden layer): $$N \approx C^2 / \epsilon^2 \quad \text{(Barron class)}$$ $$N \approx C \cdot \epsilon^{-n/s} \quad \text{(Sobolev class)}$$
Deep networks (L layers): $$N \approx C \cdot L \cdot \epsilon^{-d_{\text{eff}}/s} \quad \text{(compositional)}$$
where $d_{\text{eff}}$ is the effective dimension per layer.
Calibrating Constants:
Theoretical bounds hide constants that matter in practice. Empirical heuristics:
The Approximation-Generalization Tradeoff:
Approximation error: How well the best network in your class fits the target Generalization error: How well your trained network performs on unseen data
Total error = Approximation error + Generalization error
Bigger networks reduce approximation error but may increase generalization error (overfitting). The optimal size balances these.
Classical wisdom said to match network size to target complexity. Modern deep learning uses overparameterized networks (N >> data points) with implicit or explicit regularization. Theory is still catching up to explain why this works (double descent, lottery tickets, etc.). Approximation bounds remain relevant but don't tell the whole story.
Sample Complexity Considerations:
Approximation bounds assume we can perfectly optimize the objective. In practice, we learn from finite samples, introducing statistical error:
$$\text{Total Error} \leq \underbrace{\text{Approx Error}(N)}{\text{function class}} + \underbrace{\text{Stat Error}(N, n{\text{samples}})}_{\text{estimation}}$$
Statistical error typically scales as $\sqrt{N / n_{\text{samples}}}$—more parameters need more data.
Minimum Sample Requirements:
The interplay between approximation and sample complexity determines practical network sizing.
We've navigated the quantitative landscape of neural network approximation theory. Let's consolidate the essential insights:
What's Next:
We've characterized the approximation power of neural networks quantitatively. The next lesson examines Practical Implications—how these theoretical insights translate to real-world architecture design, hyperparameter selection, and troubleshooting. We'll bridge from theorems to engineering practice.
You now command the quantitative theory of neural network approximation—from Barron's celebrated theorem to modern deep network bounds. This knowledge enables you to reason about network sizing, understand the conditions that enable or prevent efficient approximation, and appreciate both the power and limitations of neural network function approximation.