Loading content...
At the heart of every generative model lies a seemingly simple goal: estimate the probability density function of data. Given samples $x_1, x_2, \ldots, x_n$ drawn from some unknown distribution $P^*(x)$, we want to learn an approximation $P_\theta(x)$ that captures the true data-generating process.
This is density estimation—arguably the most fundamental problem in statistics. It sounds straightforward: we have data, we want to know where that data comes from. But the apparent simplicity conceals profound challenges, especially for high-dimensional, structured data like images, audio, or text.
In this page, we develop a rigorous understanding of density estimation: what it means mathematically, how classical approaches work, where they fail, and how modern neural methods attempt to overcome these limitations.
By the end of this page, you will understand the formal definition and mathematical foundations of density estimation, the difference between parametric and non-parametric approaches, maximum likelihood estimation and its properties, the curse of dimensionality and why naive approaches fail in high dimensions, and how modern neural density estimators address these challenges.
The Density Estimation Problem
Let $X$ be a random variable taking values in some space $\mathcal{X}$ (e.g., $\mathbb{R}^d$ for continuous data). We assume data points $\mathcal{D} = {x_1, \ldots, x_n}$ are drawn i.i.d. from an unknown true distribution with density $p^*(x)$:
$$x_i \stackrel{\text{i.i.d.}}{\sim} p^*(x)$$
Our goal is to construct an estimator $\hat{p}(x)$ such that $\hat{p}$ is 'close' to $p^*$ in some meaningful sense.
What does 'close' mean?
Different notions of closeness lead to different estimation procedures:
KL Divergence (Maximum Likelihood): $$D_{KL}(p^* | \hat{p}) = \mathbb{E}_{x \sim p^}\left[\log \frac{p^(x)}{\hat{p}(x)}\right]$$ Minimizing KL divergence is equivalent to maximizing log-likelihood.
Total Variation Distance: $$TV(p^, \hat{p}) = \frac{1}{2} \int |p^(x) - \hat{p}(x)| dx$$ Upper bounds classification error.
Wasserstein Distance: $$W_1(p^, \hat{p}) = \inf_{\gamma \in \Pi(p^, \hat{p})} \mathbb{E}_{(x,y) \sim \gamma}[|x - y|]$$ Captures geometric similarity; used in Wasserstein GANs.
Each metric has different properties. KL divergence is asymmetric and can be infinite if densities have different supports. Wasserstein is more robust but computationally harder.
For continuous data, we estimate probability density functions (PDFs), where $p(x)$ is not the probability of $x$ but rather the rate of probability at $x$. For discrete data, we estimate probability mass functions (PMFs), where $p(x)$ is the actual probability of $x$. Most generative model theory applies to both, with integration replaced by summation for discrete cases.
Why is density estimation hard?
The challenge scales dramatically with dimensionality. Consider:
1D: Estimate the distribution of human heights. We might need ~100 samples to get a reasonable histogram.
10D: Estimate the joint distribution of 10 medical measurements. The 'space' we need to cover grows exponentially—with 10 bins per dimension, we have $10^{10}$ cells to estimate.
1,000,000D: Estimate the distribution of 1000×1000 RGB images. Each image is a point in million-dimensional space. We will never have enough data to 'fill' this space.
This is the curse of dimensionality—as dimensions increase, the volume of space grows exponentially, and data becomes sparse. Every point is far from every other point. Naive density estimation becomes impossible.
Modern generative models succeed by exploiting structure: natural data lies on low-dimensional manifolds, exhibits local regularities, and follows hierarchical patterns. Deep learning finds and leverages this structure.
Parametric density estimation assumes the data comes from a known family of distributions with unknown parameters. We estimate the parameters from data.
The Parametric Model:
$$p(x) = p_\theta(x)$$
where $\theta \in \Theta$ are parameters to be learned (e.g., mean $\mu$ and variance $\sigma^2$ for a Gaussian).
Advantages:
Disadvantages:
Common Parametric Models:
| Model | Density | Parameters | Use Case |
|---|---|---|---|
| Univariate Gaussian | $\frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$ | $\mu, \sigma^2$ | Continuous, unimodal data |
| Multivariate Gaussian | $\frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}} \exp\left(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\right)$ | $\mu \in \mathbb{R}^d, \Sigma \in \mathbb{R}^{d \times d}$ | Correlated continuous data |
| Exponential | $\lambda e^{-\lambda x}$ | $\lambda > 0$ | Waiting times, lifespans |
| Categorical | $\prod_k p_k^{[x=k]}$ | $p_1, \ldots, p_K$ (sum to 1) | Discrete categories |
| Poisson | $\frac{\lambda^k e^{-\lambda}}{k!}$ | $\lambda > 0$ | Count data |
Maximum Likelihood Estimation (MLE)
The workhorse of parametric estimation is MLE. Given data $\mathcal{D} = {x_1, \ldots, x_n}$, we choose parameters that maximize the probability of observing this data:
$$\hat{\theta}{MLE} = \arg\max\theta \prod_{i=1}^n p_\theta(x_i) = \arg\max_\theta \sum_{i=1}^n \log p_\theta(x_i)$$
The log transformation converts products to sums, improving numerical stability and optimization.
MLE for Gaussian:
For data from $\mathcal{N}(\mu, \sigma^2)$:
$$\hat{\mu}{MLE} = \frac{1}{n} \sum{i=1}^n x_i = \bar{x}$$
$$\hat{\sigma}^2_{MLE} = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2$$
The MLE for the mean is the sample mean—intuitive and optimal. The variance estimator is biased (divide by $n-1$ for unbiased version), but consistent.
Properties of MLE:
These properties make MLE the gold standard for parametric estimation.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
import numpy as npfrom scipy.stats import norm, multivariate_normal # Generate data from known distributionnp.random.seed(42) # True parameterstrue_mu = 5.0true_sigma = 2.0n_samples = 1000 # Generate samplesX = np.random.normal(true_mu, true_sigma, size=n_samples) # MLE estimationmu_mle = X.mean()sigma_mle = X.std() # Note: numpy uses n divisor by defaultsigma_unbiased = X.std(ddof=1) # n-1 divisor for unbiased estimate print("Univariate Gaussian MLE:")print(f" True parameters: μ = {true_mu}, σ = {true_sigma}")print(f" MLE estimates: μ = {mu_mle:.3f}, σ = {sigma_mle:.3f}")print(f" Unbiased σ: {sigma_unbiased:.3f}") # Multivariate Gaussian MLEd = 3true_mean = np.array([1.0, 2.0, 3.0])true_cov = np.array([ [1.0, 0.5, 0.3], [0.5, 2.0, 0.4], [0.3, 0.4, 1.5]]) X_multi = np.random.multivariate_normal(true_mean, true_cov, size=n_samples) # MLE for multivariate Gaussianmean_mle = X_multi.mean(axis=0)cov_mle = np.cov(X_multi.T, bias=True) # bias=True for MLE (n divisor) print(f"\nMultivariate Gaussian MLE (d={d}):")print(f" True mean: {true_mean}")print(f" MLE mean: {mean_mle.round(3)}")print(f"\n Frobenius distance between true and estimated cov:")print(f" {np.linalg.norm(true_cov - cov_mle):.4f}") # Evaluate log-likelihood on held-out dataX_test = np.random.multivariate_normal(true_mean, true_cov, size=100) # True model log-likelihoodll_true = multivariate_normal(true_mean, true_cov).logpdf(X_test).mean()# MLE model log-likelihoodll_mle = multivariate_normal(mean_mle, cov_mle).logpdf(X_test).mean() print(f"\n Mean log-likelihood on test data:")print(f" True model: {ll_true:.3f}")print(f" MLE model: {ll_mle:.3f}")Non-parametric density estimation makes no fixed assumptions about the functional form of the distribution. The model complexity grows with data.
Philosophy:
Instead of assuming $p^*(x)$ belongs to a parametric family, we let the data 'speak for itself.' The density estimate is directly derived from the observed samples.
Histogram Estimator
The simplest non-parametric approach. Divide the space into bins and estimate density as the proportion of samples in each bin:
$$\hat{p}(x) = \frac{\text{count in bin containing } x}{n \cdot \text{bin width}}$$
Properties:
Kernel Density Estimation (KDE)
KDE smooths the histogram by placing a 'kernel' (smooth bump) at each data point:
$$\hat{p}(x) = \frac{1}{n h} \sum_{i=1}^n K\left(\frac{x - x_i}{h}\right)$$
where:
Bandwidth Selection:
Common approaches:
KDE suffers severely from the curse of dimensionality. For KDE in $d$ dimensions, the optimal bandwidth scales as $h \propto n^{-1/(d+4)}$, and the mean squared error scales as $O(n^{-4/(d+4)})$. In 10 dimensions, you need millions of samples for reasonable estimates. In image dimensions (thousands+), KDE is completely impractical.
k-Nearest Neighbors Density Estimation
An alternative non-parametric approach: instead of fixing bandwidth, fix the number of neighbors:
$$\hat{p}(x) = \frac{k}{n \cdot V_d(r_k(x))}$$
where $r_k(x)$ is the distance to the $k$-th nearest neighbor, and $V_d(r)$ is the volume of a $d$-dimensional ball of radius $r$.
Intuition: In dense regions, the $k$-th neighbor is close, so the ball has small volume, giving high density. In sparse regions, the ball is large, giving low density.
Properties:
Comparison of Approaches:
| Method | Smoothness | Key Hyperparameter | Curse Severity | Normalization |
|---|---|---|---|---|
| Histogram | Discontinuous | Bin width | Severe | Automatic |
| KDE | Smooth (kernel-dependent) | Bandwidth $h$ | Severe | Automatic |
| k-NN | Smooth | Number of neighbors $k$ | Severe | Not guaranteed |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
import numpy as npfrom scipy.stats import gaussian_kde # Generate data from bimodal distributionnp.random.seed(42)n = 500 # Mixture of two Gaussiansmixture_prob = 0.4component1 = np.random.normal(-2, 0.8, int(n * mixture_prob))component2 = np.random.normal(2, 1.2, n - int(n * mixture_prob))X = np.concatenate([component1, component2]) # Kernel density estimation with different bandwidthsx_grid = np.linspace(-6, 6, 200) # Automatic bandwidth (Scott's rule)kde_auto = gaussian_kde(X)density_auto = kde_auto(x_grid) # Smaller bandwidth (more detail, more variance)kde_small = gaussian_kde(X, bw_method=0.1)density_small = kde_small(x_grid) # Larger bandwidth (smoother, more bias)kde_large = gaussian_kde(X, bw_method=0.5)density_large = kde_large(x_grid) print("Kernel Density Estimation Comparison:")print(f" Sample size: {n}")print(f" Auto bandwidth (Scott): {kde_auto.factor:.4f}")print(f" Small bandwidth: 0.1 (relative)")print(f" Large bandwidth: 0.5 (relative)") # True density for comparisonfrom scipy.stats import normtrue_density = (mixture_prob * norm.pdf(x_grid, -2, 0.8) + (1 - mixture_prob) * norm.pdf(x_grid, 2, 1.2)) # Compute integrated squared errorise_auto = np.trapz((density_auto - true_density)**2, x_grid)ise_small = np.trapz((density_small - true_density)**2, x_grid)ise_large = np.trapz((density_large - true_density)**2, x_grid) print(f"\nIntegrated Squared Error (lower is better):")print(f" Auto bandwidth: {ise_auto:.6f}")print(f" Small bandwidth: {ise_small:.6f}")print(f" Large bandwidth: {ise_large:.6f}") # Demonstrate curse of dimensionalityprint("\n--- Curse of Dimensionality ---")for d in [1, 2, 5, 10, 20]: # Required samples for fixed error, roughly scales as exp(d) n_required = int(1000 * (5 ** (d-1))) print(f" d={d:2d}: ~{n_required:,} samples needed for good KDE")Mixture models combine parametric flexibility with non-parametric-like expressiveness. They model the distribution as a weighted sum of simpler component distributions:
$$p(x) = \sum_{k=1}^K \pi_k \cdot p_k(x|\theta_k)$$
where:
Gaussian Mixture Models (GMMs):
The most common mixture model uses Gaussian components:
$$p(x) = \sum_{k=1}^K \pi_k \cdot \mathcal{N}(x | \mu_k, \Sigma_k)$$
Why mixtures are powerful:
This latent variable view foreshadows more complex generative models.
Expectation-Maximization (EM) Algorithm
Directly maximizing likelihood for mixtures is hard—we don't know which component generated each point. EM solves this by alternating:
E-step (Expectation): Compute soft assignments—probability that each point came from each component:
$$\gamma_{ik} = \frac{\pi_k \cdot p_k(x_i|\theta_k)}{\sum_j \pi_j \cdot p_j(x_i|\theta_j)}$$
M-step (Maximization): Update parameters using weighted statistics:
$$\pi_k^{new} = \frac{1}{n} \sum_i \gamma_{ik}$$
$$\mu_k^{new} = \frac{\sum_i \gamma_{ik} x_i}{\sum_i \gamma_{ik}}$$
$$\Sigma_k^{new} = \frac{\sum_i \gamma_{ik} (x_i - \mu_k^{new})(x_i - \mu_k^{new})^T}{\sum_i \gamma_{ik}}$$
Properties of EM:
GMMs are the conceptual ancestor of variational autoencoders (VAEs). VAEs also have latent variables (like mixture components) and use a form of EM for training. The key difference: VAEs use neural networks to parameterize flexible, continuous latent spaces rather than discrete mixture components. Understanding GMM foundations illuminates VAE design.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
import numpy as npfrom sklearn.mixture import GaussianMixturefrom scipy.stats import multivariate_normal # Generate data from known mixturenp.random.seed(42)n = 600 # True mixture: 3 components in 2Dtrue_weights = [0.3, 0.5, 0.2]true_means = [np.array([-3, 0]), np.array([2, 2]), np.array([0, -3])]true_covs = [ np.array([[1, 0.5], [0.5, 1]]), np.array([[0.5, 0], [0, 2]]), np.array([[1.5, -0.3], [-0.3, 0.5]])] # Generate samplesX = []true_labels = []for i in range(n): k = np.random.choice(3, p=true_weights) true_labels.append(k) X.append(np.random.multivariate_normal(true_means[k], true_covs[k]))X = np.array(X)true_labels = np.array(true_labels) # Fit GMM with known number of componentsgmm = GaussianMixture(n_components=3, random_state=42)gmm.fit(X) print("Gaussian Mixture Model Results:")print(f" Number of samples: {n}")print(f" Number of components: 3")print(f"\n Converged: {gmm.converged_}")print(f" Iterations: {gmm.n_iter_}")print(f" Log-likelihood per sample: {gmm.score(X):.3f}") print(f"\n True mixing weights: {true_weights}")print(f" Learned weights: {gmm.weights_.round(3).tolist()}") # Evaluate density on a gridx_range = np.linspace(-6, 5, 100)y_range = np.linspace(-6, 5, 100)X_grid, Y_grid = np.meshgrid(x_range, y_range)positions = np.column_stack([X_grid.ravel(), Y_grid.ravel()]) # Compute learned densitylog_density = gmm.score_samples(positions)density = np.exp(log_density).reshape(X_grid.shape) # Sample from the learned modelX_synthetic, _ = gmm.sample(200)print(f"\n Generated {len(X_synthetic)} synthetic samples") # Model selection with BICprint("\n--- Model Selection (BIC) ---")bics = []for k in range(1, 7): gmm_k = GaussianMixture(n_components=k, random_state=42) gmm_k.fit(X) bics.append(gmm_k.bic(X)) print(f" K={k}: BIC = {bics[-1]:.1f}")print(f" Best K (lowest BIC): {np.argmin(bics) + 1}")Classical density estimation methods—histograms, KDE, GMMs—work well in low dimensions but fail catastrophically for high-dimensional data like images. Neural density estimation uses deep learning to learn expressive density models that can scale to complex data.
The Core Challenge:
To define a valid probability density, we need:
For neural networks outputting arbitrary values, ensuring these constraints is non-trivial.
Approach 1: Autoregressive Models
Factor the joint density using the chain rule of probability:
$$p(x) = p(x_1) \cdot p(x_2|x_1) \cdot p(x_3|x_1, x_2) \cdots = \prod_{i=1}^d p(x_i | x_{<i})$$
Each conditional $p(x_i | x_{<i})$ is modeled by a neural network. If each conditional is a valid density (e.g., categorical for discrete, Gaussian for continuous), the product is automatically valid.
Examples:
Advantages: Exact likelihood computation, powerful expressiveness Disadvantages: Sequential generation (slow), arbitrary ordering choice
Autoregressive models require an ordering of variables. For sequences (text, audio), natural order exists. For images, the raster-scan order (left-to-right, top-to-bottom) is conventional but arbitrary—the upper-left pixel has no causal priority over the lower-right. Despite this arbitrariness, autoregressive image models work remarkably well.
Approach 2: Normalizing Flows
Transform a simple base distribution (e.g., standard Gaussian) through invertible transformations to create a complex distribution:
$$z \sim \mathcal{N}(0, I) \quad \rightarrow \quad x = f_\theta(z)$$
If $f$ is invertible and differentiable, the density of $x$ follows from the change of variables formula:
$$p_X(x) = p_Z(f^{-1}(x)) \cdot \left|\det \frac{\partial f^{-1}}{\partial x}\right|$$
The Jacobian determinant accounts for how $f$ stretches or compresses space.
Key insight: By stacking many simple invertible transformations, we can build arbitrarily complex distributions while maintaining tractable density evaluation.
Examples: RealNVP, Glow, Neural Spline Flows
Advantages: Exact likelihood, efficient sampling, latent space Disadvantages: Architectural constraints (must be invertible), Jacobian computation can be costly
Approach 3: Energy-Based Models
Define an unnormalized density (energy function) and train without computing the normalizing constant:
$$p_\theta(x) = \frac{\exp(-E_\theta(x))}{Z_\theta} \quad \text{where} \quad Z_\theta = \int \exp(-E_\theta(x)) dx$$
The energy $E_\theta(x)$ can be any neural network—no constraints! Low energy = high probability.
Problem: $Z_\theta$ is intractable (integral over all possible $x$).
Solutions:
Deep connection: Score-based diffusion models (state-of-the-art image generation) are intimately related to energy-based models through the score function $\nabla_x \log p(x)$.
Approach 4: Variational Autoencoders (VAEs)
Learn a latent variable model by optimizing a tractable lower bound on log-likelihood:
$$\log p(x) \geq \mathbb{E}{q(z|x)}[\log p(x|z)] - D{KL}(q(z|x) | p(z))$$
VAEs will be covered in depth in Module 2.
| Method | Exact Likelihood? | Sampling Speed | Latent Space? | Architecture Constraints |
|---|---|---|---|---|
| Autoregressive | Yes | Slow (sequential) | No (unless hybrid) | Causal masking |
| Normalizing Flows | Yes | Fast (one pass) | Yes (base distribution) | Must be invertible |
| Energy-Based | No (unnormalized) | Slow (MCMC) | No | None |
| VAE | No (lower bound) | Fast (one pass) | Yes (latent z) | Encoder-decoder |
Maximum likelihood estimation is so fundamental to density estimation that it deserves deeper examination. Understanding its properties, limitations, and connections to information theory is essential for modern generative modeling.
The Likelihood Function:
Given data $\mathcal{D} = {x_1, \ldots, x_n}$ and model $p_\theta$, the likelihood is:
$$L(\theta) = \prod_{i=1}^n p_\theta(x_i)$$
The log-likelihood is:
$$\ell(\theta) = \sum_{i=1}^n \log p_\theta(x_i)$$
MLE as KL Minimization:
A beautiful connection: minimizing KL divergence from true distribution to model is equivalent to maximizing likelihood:
$$D_{KL}(p^* | p_\theta) = \mathbb{E}{x \sim p^}[\log p^(x)] - \mathbb{E}{x \sim p^*}[\log p_\theta(x)]$$
The first term is the (negative) entropy of the true distribution—constant with respect to $\theta$. Minimizing KL is thus equivalent to maximizing $\mathbb{E}{x \sim p^*}[\log p\theta(x)]$, which is estimated by the sample log-likelihood.
Cross-Entropy Loss:
In practice, we minimize the cross-entropy:
$$H(p^, p_\theta) = -\mathbb{E}_{x \sim p^}[\log p_\theta(x)] \approx -\frac{1}{n}\sum_{i=1}^n \log p_\theta(x_i)$$
This is exactly the negative log-likelihood (per sample). Whether you call it 'maximizing likelihood' or 'minimizing cross-entropy,' it's the same optimization.
For image models, likelihood is often reported as 'bits per dimension' (bits/dim): the negative log-likelihood in base 2, divided by the number of dimensions (pixels). Lower is better. A bits/dim of 3.0 means the model uses 3 bits on average to encode each pixel. State-of-the-art models achieve ~2.5-3.5 bits/dim on natural images.
MLE Pathologies:
Despite its elegance, MLE has important limitations:
1. Mode-Seeking Behavior
Minimizing $D_{KL}(p^* | p_\theta)$ (forward KL) is mode-seeking: the model strongly penalizes assigning low probability to high-probability regions of $p^*$, but can ignore modes (assign them probability near zero).
This contrasts with reverse KL, $D_{KL}(p_\theta | p^*)$, which is mode-covering but rarely used in density estimation.
2. Overfitting
With powerful models (many parameters), MLE can memorize training data. For exact histogram density estimation, the estimate assigns probability $1/n$ to each unique training point—useless for generalization.
3. Infinite Likelihood Trap
For continuous data, likelihood can be made arbitrarily high by placing delta functions at training points. Regularization (priors, early stopping) is essential.
4. Blind to Perceptual Quality
MLE optimizes for log-probability, which doesn't directly correspond to human perception. A model can have high likelihood but generate blurry images, or low likelihood but great samples. This motivated alternative objectives like adversarial training (GANs).
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
import numpy as npfrom scipy.special import logsumexp # Demonstrate MLE as KL minimizationnp.random.seed(42) # True distribution: mixture of Gaussiansdef true_log_prob(x): """Log p*(x) for mixture of two Gaussians.""" component1 = -0.5 * ((x - (-2)) ** 2) - 0.5 * np.log(2 * np.pi) component2 = -0.5 * ((x - 2) ** 2) - 0.5 * np.log(2 * np.pi) return logsumexp([component1 + np.log(0.5), component2 + np.log(0.5)]) # Model: single Gaussian (misspecified)def model_log_prob(x, mu, sigma): """Log p_θ(x) for single Gaussian.""" return -0.5 * ((x - mu) / sigma) ** 2 - np.log(sigma) - 0.5 * np.log(2 * np.pi) # Generate samples from true distributionn = 1000true_samples = np.concatenate([ np.random.normal(-2, 1, n // 2), np.random.normal(2, 1, n // 2)]) # MLE for misspecified Gaussian modelmu_mle = true_samples.mean() # Will be ~0 (between modes)sigma_mle = true_samples.std() # Will be ~2.2 (inflated to cover both) print("MLE with Misspecified Model:")print(f" True distribution: Mixture of N(-2,1) and N(2,1)")print(f" Model family: Single Gaussian N(μ, σ²)")print(f"\n MLE estimates: μ = {mu_mle:.3f}, σ = {sigma_mle:.3f}")print(f" (μ ends up between modes, σ is inflated)") # Compute likelihoodssample_lls_true = np.array([true_log_prob(x) for x in true_samples])sample_lls_model = model_log_prob(true_samples, mu_mle, sigma_mle) print(f"\n Mean log-likelihood under true distribution: {sample_lls_true.mean():.3f}")print(f" Mean log-likelihood under MLE model: {sample_lls_model.mean():.3f}")print(f" Gap (≈ KL divergence estimate): {sample_lls_true.mean() - sample_lls_model.mean():.3f}") # Bits per dimension interpretationbits_per_dim = -sample_lls_model.mean() / np.log(2)print(f"\n Bits per dimension (1D): {bits_per_dim:.3f}") # Demonstrate mode-seeking behaviorx_test = np.linspace(-6, 6, 200)true_density = np.exp(np.array([true_log_prob(x) for x in x_test]))model_density = np.exp(model_log_prob(x_test, mu_mle, sigma_mle)) # Model assigns high probability to x=0 (between modes)# but misses the actual modes at -2 and 2print("\n Mode-seeking behavior:")print(f" True density at x=0: {np.exp(true_log_prob(0)):.4f}")print(f" Model density at x=0: {np.exp(model_log_prob(0, mu_mle, sigma_mle)):.4f}")print(f" True density at x=2: {np.exp(true_log_prob(2)):.4f}")print(f" Model density at x=2: {np.exp(model_log_prob(2, mu_mle, sigma_mle)):.4f}")A radical departure from classical density estimation: implicit density models define distributions through a sampling procedure rather than an explicit density function.
The Key Insight:
Instead of learning $p_\theta(x)$ explicitly, learn a function $G_\theta: z \mapsto x$ that transforms noise into samples:
$$z \sim \mathcal{N}(0, I), \quad x = G_\theta(z)$$
The distribution of $x$ is implicitly defined by $G_\theta$ and the noise distribution. We can sample from it easily (forward pass through $G$), but we cannot evaluate $p_\theta(x)$ for arbitrary $x$.
Why abandon explicit density?
Tractability: Computing normalizing constants and Jacobians is hard. Implicit models sidestep this entirely.
Flexibility: Without density constraints, $G_\theta$ can be any neural network.
Sample quality: Empirically, implicit models (especially GANs) produce sharper samples than likelihood-based models.
The Trade-off:
Without explicit densities, important capabilities vanish: comparing likelihood of models via held-out data, computing exact marginals and conditionals, principled uncertainty quantification, and density-based anomaly detection. Implicit models trade these for flexibility and sample quality.
Generative Adversarial Networks (GANs):
The most famous implicit model. A generator $G$ tries to produce realistic samples, while a discriminator $D$ tries to distinguish real from fake:
$$\min_G \max_D \mathbb{E}{x \sim p{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]$$
At equilibrium (in theory), $G$ produces samples indistinguishable from real data, implying $p_G = p_{data}$.
Moment Matching:
Another implicit approach: match statistics (moments) of generated and real distributions:
$$\min_\theta \left| \mathbb{E}{x \sim p{data}}[\phi(x)] - \mathbb{E}{z \sim p_z}[\phi(G\theta(z))] \right|^2$$
where $\phi$ extracts features. This is the basis of Maximum Mean Discrepancy (MMD) training.
Hybrid Approaches:
Some models combine implicit and explicit elements:
| Property | Explicit Models | Implicit Models |
|---|---|---|
| Density evaluation | Yes ($p_\theta(x)$ computable) | No (only sampling) |
| Training objective | Maximum likelihood | Adversarial / Moment matching |
| Architectural freedom | Often constrained | Fully flexible |
| Sample quality | Sometimes blurry | Often sharper |
| Mode coverage | Better (likelihood penalizes missing modes) | Worse (mode collapse common) |
| Anomaly detection | Natural via low probability | Difficult without density |
Density estimation is the mathematical backbone of generative modeling. We've traversed from classical approaches to modern neural methods, illuminating how different techniques trade off expressiveness, tractability, and computational cost.
What's next:
Density estimation tells us how to evaluate the probability of data. But generative models must also sample—create new data instances from the learned distribution. In the next page, we'll explore sampling techniques: how to convert learned distributions into tangible new examples, from simple ancestral sampling to sophisticated MCMC methods.
You now have a rigorous understanding of density estimation—the mathematical core of generative modeling. Every generative model we study, from VAEs to diffusion models, is a specific answer to: 'How do we estimate P(x) for complex, high-dimensional data?' The techniques differ, but the goal remains density estimation.