Generative Model Fundamentals - Learning Module

Loading content...

0/245

Density Estimation

Learning the Shape of Data

At the heart of every generative model lies a seemingly simple goal: estimate the probability density function of data. Given samples $x_1, x_2, \ldots, x_n$ drawn from some unknown distribution $P^*(x)$, we want to learn an approximation $P_\theta(x)$ that captures the true data-generating process.

This is density estimation—arguably the most fundamental problem in statistics. It sounds straightforward: we have data, we want to know where that data comes from. But the apparent simplicity conceals profound challenges, especially for high-dimensional, structured data like images, audio, or text.

In this page, we develop a rigorous understanding of density estimation: what it means mathematically, how classical approaches work, where they fail, and how modern neural methods attempt to overcome these limitations.

What You Will Learn

By the end of this page, you will understand the formal definition and mathematical foundations of density estimation, the difference between parametric and non-parametric approaches, maximum likelihood estimation and its properties, the curse of dimensionality and why naive approaches fail in high dimensions, and how modern neural density estimators address these challenges.

Formal Definition

The Density Estimation Problem

Let $X$ be a random variable taking values in some space $\mathcal{X}$ (e.g., $\mathbb{R}^d$ for continuous data). We assume data points $\mathcal{D} = {x_1, \ldots, x_n}$ are drawn i.i.d. from an unknown true distribution with density $p^*(x)$:

$$x_i \stackrel{\text{i.i.d.}}{\sim} p^*(x)$$

Our goal is to construct an estimator $\hat{p}(x)$ such that $\hat{p}$ is 'close' to $p^*$ in some meaningful sense.

What does 'close' mean?

Different notions of closeness lead to different estimation procedures:

KL Divergence (Maximum Likelihood): $$D_{KL}(p^* | \hat{p}) = \mathbb{E}_{x \sim p^}\left[\log \frac{p^(x)}{\hat{p}(x)}\right]$$ Minimizing KL divergence is equivalent to maximizing log-likelihood.
Total Variation Distance: $$TV(p^, \hat{p}) = \frac{1}{2} \int |p^(x) - \hat{p}(x)| dx$$ Upper bounds classification error.
Wasserstein Distance: $$W_1(p^, \hat{p}) = \inf_{\gamma \in \Pi(p^, \hat{p})} \mathbb{E}_{(x,y) \sim \gamma}[|x - y|]$$ Captures geometric similarity; used in Wasserstein GANs.

Each metric has different properties. KL divergence is asymmetric and can be infinite if densities have different supports. Wasserstein is more robust but computationally harder.

Density vs. Mass

For continuous data, we estimate probability density functions (PDFs), where $p(x)$ is not the probability of $x$ but rather the rate of probability at $x$. For discrete data, we estimate probability mass functions (PMFs), where $p(x)$ is the actual probability of $x$. Most generative model theory applies to both, with integration replaced by summation for discrete cases.

Why is density estimation hard?

The challenge scales dramatically with dimensionality. Consider:

1D: Estimate the distribution of human heights. We might need ~100 samples to get a reasonable histogram.
10D: Estimate the joint distribution of 10 medical measurements. The 'space' we need to cover grows exponentially—with 10 bins per dimension, we have $10^{10}$ cells to estimate.
1,000,000D: Estimate the distribution of 1000×1000 RGB images. Each image is a point in million-dimensional space. We will never have enough data to 'fill' this space.

This is the curse of dimensionality—as dimensions increase, the volume of space grows exponentially, and data becomes sparse. Every point is far from every other point. Naive density estimation becomes impossible.

Modern generative models succeed by exploiting structure: natural data lies on low-dimensional manifolds, exhibits local regularities, and follows hierarchical patterns. Deep learning finds and leverages this structure.

Parametric Approaches

Parametric density estimation assumes the data comes from a known family of distributions with unknown parameters. We estimate the parameters from data.

The Parametric Model:

$$p(x) = p_\theta(x)$$

where $\theta \in \Theta$ are parameters to be learned (e.g., mean $\mu$ and variance $\sigma^2$ for a Gaussian).

Advantages:

Sample-efficient (fewer parameters to estimate)
Often computationally tractable
Provides interpretable parameters

Disadvantages:

Assumes a specific functional form
Model misspecification leads to systematic bias
Limited flexibility

Common Parametric Models:

Common Parametric Density Models
Model	Density	Parameters	Use Case
Univariate Gaussian	$\frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$	$\mu, \sigma^2$	Continuous, unimodal data
Multivariate Gaussian	$\frac{1}{(2\pi)^{d/2}\|\Sigma\|^{1/2}} \exp\left(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\right)$	$\mu \in \mathbb{R}^d, \Sigma \in \mathbb{R}^{d \times d}$	Correlated continuous data
Exponential	$\lambda e^{-\lambda x}$	$\lambda > 0$	Waiting times, lifespans
Categorical	$\prod_k p_k^{[x=k]}$	$p_1, \ldots, p_K$ (sum to 1)	Discrete categories
Poisson	$\frac{\lambda^k e^{-\lambda}}{k!}$	$\lambda > 0$	Count data

Maximum Likelihood Estimation (MLE)

The workhorse of parametric estimation is MLE. Given data $\mathcal{D} = {x_1, \ldots, x_n}$, we choose parameters that maximize the probability of observing this data:

$$\hat{\theta}{MLE} = \arg\max\theta \prod_{i=1}^n p_\theta(x_i) = \arg\max_\theta \sum_{i=1}^n \log p_\theta(x_i)$$

The log transformation converts products to sums, improving numerical stability and optimization.

MLE for Gaussian:

For data from $\mathcal{N}(\mu, \sigma^2)$:

$$\hat{\mu}{MLE} = \frac{1}{n} \sum{i=1}^n x_i = \bar{x}$$

$$\hat{\sigma}^2_{MLE} = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2$$

The MLE for the mean is the sample mean—intuitive and optimal. The variance estimator is biased (divide by $n-1$ for unbiased version), but consistent.

Properties of MLE:

Consistency: $\hat{\theta}_{MLE} \to \theta^*$ as $n \to \infty$ (if model is correctly specified)
Asymptotic efficiency: Achieves the Cramér-Rao lower bound
Asymptotic normality: $\sqrt{n}(\hat{\theta}_{MLE} - \theta^) \to \mathcal{N}(0, I^{-1}(\theta^))$

These properties make MLE the gold standard for parametric estimation.

parametric_mle.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import numpy as np
from scipy.stats import norm, multivariate_normal
 
# Generate data from known distribution
np.random.seed(42)
 
# True parameters
true_mu = 5.0
true_sigma = 2.0
n_samples = 1000
 
# Generate samples
X = np.random.normal(true_mu, true_sigma, size=n_samples)
 
# MLE estimation
mu_mle = X.mean()
sigma_mle = X.std()  # Note: numpy uses n divisor by default
sigma_unbiased = X.std(ddof=1)  # n-1 divisor for unbiased estimate
 
print("Univariate Gaussian MLE:")
print(f"  True parameters: μ = {true_mu}, σ = {true_sigma}")
print(f"  MLE estimates:   μ = {mu_mle:.3f}, σ = {sigma_mle:.3f}")
print(f"  Unbiased σ:      {sigma_unbiased:.3f}")
 
# Multivariate Gaussian MLE
d = 3
true_mean = np.array([1.0, 2.0, 3.0])
true_cov = np.array([
    [1.0, 0.5, 0.3],
    [0.5, 2.0, 0.4],
    [0.3, 0.4, 1.5]
])
 
X_multi = np.random.multivariate_normal(true_mean, true_cov, size=n_samples)
 
# MLE for multivariate Gaussian
mean_mle = X_multi.mean(axis=0)
cov_mle = np.cov(X_multi.T, bias=True)  # bias=True for MLE (n divisor)
 
print(f"\nMultivariate Gaussian MLE (d={d}):")
print(f"  True mean: {true_mean}")
print(f"  MLE mean:  {mean_mle.round(3)}")
print(f"\n  Frobenius distance between true and estimated cov:")
print(f"    {np.linalg.norm(true_cov - cov_mle):.4f}")
 
# Evaluate log-likelihood on held-out data
X_test = np.random.multivariate_normal(true_mean, true_cov, size=100)
 
# True model log-likelihood
ll_true = multivariate_normal(true_mean, true_cov).logpdf(X_test).mean()
# MLE model log-likelihood
ll_mle = multivariate_normal(mean_mle, cov_mle).logpdf(X_test).mean()
 
print(f"\n  Mean log-likelihood on test data:")
print(f"    True model: {ll_true:.3f}")
print(f"    MLE model:  {ll_mle:.3f}")

Non-Parametric Approaches

Non-parametric density estimation makes no fixed assumptions about the functional form of the distribution. The model complexity grows with data.

Philosophy:

Instead of assuming $p^*(x)$ belongs to a parametric family, we let the data 'speak for itself.' The density estimate is directly derived from the observed samples.

Histogram Estimator

The simplest non-parametric approach. Divide the space into bins and estimate density as the proportion of samples in each bin:

$$\hat{p}(x) = \frac{\text{count in bin containing } x}{n \cdot \text{bin width}}$$

Properties:

Extremely simple to implement
Density is normalized (integrates to 1)
Discontinuous (density jumps at bin boundaries)
Highly sensitive to bin placement and width

Kernel Density Estimation (KDE)

KDE smooths the histogram by placing a 'kernel' (smooth bump) at each data point:

$$\hat{p}(x) = \frac{1}{n h} \sum_{i=1}^n K\left(\frac{x - x_i}{h}\right)$$

where:

$K$ is a kernel function (e.g., Gaussian: $K(u) = \frac{1}{\sqrt{2\pi}} e^{-u^2/2}$)
$h$ is the bandwidth—a crucial hyperparameter

Bandwidth Selection:

Too small $h$: Overfitting—density is spiky, each point creates its own peak
Too large $h$: Underfitting—density is overly smooth, loses detail
Optimal $h$: Balances bias (smoothness) and variance (sensitivity to individual points)

Common approaches:

Silverman's rule: $h = 0.9 \cdot \min(\hat{\sigma}, IQR/1.34) \cdot n^{-1/5}$ (for 1D Gaussian kernel)
Cross-validation: Leave-one-out likelihood optimization

The Curse Strikes Again

KDE suffers severely from the curse of dimensionality. For KDE in $d$ dimensions, the optimal bandwidth scales as $h \propto n^{-1/(d+4)}$, and the mean squared error scales as $O(n^{-4/(d+4)})$. In 10 dimensions, you need millions of samples for reasonable estimates. In image dimensions (thousands+), KDE is completely impractical.

k-Nearest Neighbors Density Estimation

An alternative non-parametric approach: instead of fixing bandwidth, fix the number of neighbors:

$$\hat{p}(x) = \frac{k}{n \cdot V_d(r_k(x))}$$

where $r_k(x)$ is the distance to the $k$-th nearest neighbor, and $V_d(r)$ is the volume of a $d$-dimensional ball of radius $r$.

Intuition: In dense regions, the $k$-th neighbor is close, so the ball has small volume, giving high density. In sparse regions, the ball is large, giving low density.

Properties:

Adaptive bandwidth (effectively)
Works better in regions with varying density
Still suffers from curse of dimensionality
Not guaranteed to integrate to 1

Comparison of Approaches:

Non-Parametric Density Estimators
Method	Smoothness	Key Hyperparameter	Curse Severity	Normalization
Histogram	Discontinuous	Bin width	Severe	Automatic
KDE	Smooth (kernel-dependent)	Bandwidth $h$	Severe	Automatic
k-NN	Smooth	Number of neighbors $k$	Severe	Not guaranteed

nonparametric_kde.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import numpy as np
from scipy.stats import gaussian_kde
 
# Generate data from bimodal distribution
np.random.seed(42)
n = 500
 
# Mixture of two Gaussians
mixture_prob = 0.4
component1 = np.random.normal(-2, 0.8, int(n * mixture_prob))
component2 = np.random.normal(2, 1.2, n - int(n * mixture_prob))
X = np.concatenate([component1, component2])
 
# Kernel density estimation with different bandwidths
x_grid = np.linspace(-6, 6, 200)
 
# Automatic bandwidth (Scott's rule)
kde_auto = gaussian_kde(X)
density_auto = kde_auto(x_grid)
 
# Smaller bandwidth (more detail, more variance)
kde_small = gaussian_kde(X, bw_method=0.1)
density_small = kde_small(x_grid)
 
# Larger bandwidth (smoother, more bias)
kde_large = gaussian_kde(X, bw_method=0.5)
density_large = kde_large(x_grid)
 
print("Kernel Density Estimation Comparison:")
print(f"  Sample size: {n}")
print(f"  Auto bandwidth (Scott): {kde_auto.factor:.4f}")
print(f"  Small bandwidth: 0.1 (relative)")
print(f"  Large bandwidth: 0.5 (relative)")
 
# True density for comparison
from scipy.stats import norm
true_density = (mixture_prob * norm.pdf(x_grid, -2, 0.8) + 
                (1 - mixture_prob) * norm.pdf(x_grid, 2, 1.2))
 
# Compute integrated squared error
ise_auto = np.trapz((density_auto - true_density)**2, x_grid)
ise_small = np.trapz((density_small - true_density)**2, x_grid)
ise_large = np.trapz((density_large - true_density)**2, x_grid)
 
print(f"\nIntegrated Squared Error (lower is better):")
print(f"  Auto bandwidth:  {ise_auto:.6f}")
print(f"  Small bandwidth: {ise_small:.6f}")
print(f"  Large bandwidth: {ise_large:.6f}")
 
# Demonstrate curse of dimensionality
print("\n--- Curse of Dimensionality ---")
for d in [1, 2, 5, 10, 20]:
    # Required samples for fixed error, roughly scales as exp(d)
    n_required = int(1000 * (5 ** (d-1)))
    print(f"  d={d:2d}: ~{n_required:,} samples needed for good KDE")

Mixture Models

Mixture models combine parametric flexibility with non-parametric-like expressiveness. They model the distribution as a weighted sum of simpler component distributions:

$$p(x) = \sum_{k=1}^K \pi_k \cdot p_k(x|\theta_k)$$

where:

$K$ is the number of components (clusters)
$\pi_k$ are mixing weights ($\sum_k \pi_k = 1$, $\pi_k \geq 0$)
$p_k(x|\theta_k)$ are component densities

Gaussian Mixture Models (GMMs):

The most common mixture model uses Gaussian components:

$$p(x) = \sum_{k=1}^K \pi_k \cdot \mathcal{N}(x | \mu_k, \Sigma_k)$$

Why mixtures are powerful:

Universal approximation: With enough components, GMMs can approximate any continuous density arbitrarily well.
Interpretable structure: Each component can represent a distinct 'mode' or cluster in the data.
Latent variable interpretation: The mixture can be viewed as a generative process:
- First, sample component $z \sim \text{Categorical}(\pi_1, \ldots, \pi_K)$
- Then, sample $x \sim p_z(x|\theta_z)$

This latent variable view foreshadows more complex generative models.

Expectation-Maximization (EM) Algorithm

Directly maximizing likelihood for mixtures is hard—we don't know which component generated each point. EM solves this by alternating:

E-step (Expectation): Compute soft assignments—probability that each point came from each component:

$$\gamma_{ik} = \frac{\pi_k \cdot p_k(x_i|\theta_k)}{\sum_j \pi_j \cdot p_j(x_i|\theta_j)}$$

M-step (Maximization): Update parameters using weighted statistics:

$$\pi_k^{new} = \frac{1}{n} \sum_i \gamma_{ik}$$

$$\mu_k^{new} = \frac{\sum_i \gamma_{ik} x_i}{\sum_i \gamma_{ik}}$$

$$\Sigma_k^{new} = \frac{\sum_i \gamma_{ik} (x_i - \mu_k^{new})(x_i - \mu_k^{new})^T}{\sum_i \gamma_{ik}}$$

Properties of EM:

Guarantees likelihood increases (or stays same) each iteration
Converges to local maximum (not necessarily global)
Simple to implement, but may be slow to converge
Sensitive to initialization

From GMMs to Deep Generative Models

GMMs are the conceptual ancestor of variational autoencoders (VAEs). VAEs also have latent variables (like mixture components) and use a form of EM for training. The key difference: VAEs use neural networks to parameterize flexible, continuous latent spaces rather than discrete mixture components. Understanding GMM foundations illuminates VAE design.

gaussian_mixture_models.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import numpy as np
from sklearn.mixture import GaussianMixture
from scipy.stats import multivariate_normal
 
# Generate data from known mixture
np.random.seed(42)
n = 600
 
# True mixture: 3 components in 2D
true_weights = [0.3, 0.5, 0.2]
true_means = [np.array([-3, 0]), np.array([2, 2]), np.array([0, -3])]
true_covs = [
    np.array([[1, 0.5], [0.5, 1]]),
    np.array([[0.5, 0], [0, 2]]),
    np.array([[1.5, -0.3], [-0.3, 0.5]])
]
 
# Generate samples
X = []
true_labels = []
for i in range(n):
    k = np.random.choice(3, p=true_weights)
    true_labels.append(k)
    X.append(np.random.multivariate_normal(true_means[k], true_covs[k]))
X = np.array(X)
true_labels = np.array(true_labels)
 
# Fit GMM with known number of components
gmm = GaussianMixture(n_components=3, random_state=42)
gmm.fit(X)
 
print("Gaussian Mixture Model Results:")
print(f"  Number of samples: {n}")
print(f"  Number of components: 3")
print(f"\n  Converged: {gmm.converged_}")
print(f"  Iterations: {gmm.n_iter_}")
print(f"  Log-likelihood per sample: {gmm.score(X):.3f}")
 
print(f"\n  True mixing weights: {true_weights}")
print(f"  Learned weights:     {gmm.weights_.round(3).tolist()}")
 
# Evaluate density on a grid
x_range = np.linspace(-6, 5, 100)
y_range = np.linspace(-6, 5, 100)
X_grid, Y_grid = np.meshgrid(x_range, y_range)
positions = np.column_stack([X_grid.ravel(), Y_grid.ravel()])
 
# Compute learned density
log_density = gmm.score_samples(positions)
density = np.exp(log_density).reshape(X_grid.shape)
 
# Sample from the learned model
X_synthetic, _ = gmm.sample(200)
print(f"\n  Generated {len(X_synthetic)} synthetic samples")
 
# Model selection with BIC
print("\n--- Model Selection (BIC) ---")
bics = []
for k in range(1, 7):
    gmm_k = GaussianMixture(n_components=k, random_state=42)
    gmm_k.fit(X)
    bics.append(gmm_k.bic(X))
    print(f"  K={k}: BIC = {bics[-1]:.1f}")
print(f"  Best K (lowest BIC): {np.argmin(bics) + 1}")

Neural Density Estimation

Classical density estimation methods—histograms, KDE, GMMs—work well in low dimensions but fail catastrophically for high-dimensional data like images. Neural density estimation uses deep learning to learn expressive density models that can scale to complex data.

The Core Challenge:

To define a valid probability density, we need:

$p(x) \geq 0$ for all $x$ (non-negativity)
$\int p(x) dx = 1$ (normalization)

For neural networks outputting arbitrary values, ensuring these constraints is non-trivial.

Approach 1: Autoregressive Models

Factor the joint density using the chain rule of probability:

$$p(x) = p(x_1) \cdot p(x_2|x_1) \cdot p(x_3|x_1, x_2) \cdots = \prod_{i=1}^d p(x_i | x_{<i})$$

Each conditional $p(x_i | x_{<i})$ is modeled by a neural network. If each conditional is a valid density (e.g., categorical for discrete, Gaussian for continuous), the product is automatically valid.

Examples:

PixelCNN/PixelRNN: Model image pixels left-to-right, top-to-bottom
WaveNet: Model audio samples autoregressively
GPT: Model text tokens autoregressively

Advantages: Exact likelihood computation, powerful expressiveness Disadvantages: Sequential generation (slow), arbitrary ordering choice

The Ordering Problem

Autoregressive models require an ordering of variables. For sequences (text, audio), natural order exists. For images, the raster-scan order (left-to-right, top-to-bottom) is conventional but arbitrary—the upper-left pixel has no causal priority over the lower-right. Despite this arbitrariness, autoregressive image models work remarkably well.

Approach 2: Normalizing Flows

Transform a simple base distribution (e.g., standard Gaussian) through invertible transformations to create a complex distribution:

$$z \sim \mathcal{N}(0, I) \quad \rightarrow \quad x = f_\theta(z)$$

If $f$ is invertible and differentiable, the density of $x$ follows from the change of variables formula:

$$p_X(x) = p_Z(f^{-1}(x)) \cdot \left|\det \frac{\partial f^{-1}}{\partial x}\right|$$

The Jacobian determinant accounts for how $f$ stretches or compresses space.

Key insight: By stacking many simple invertible transformations, we can build arbitrarily complex distributions while maintaining tractable density evaluation.

Examples: RealNVP, Glow, Neural Spline Flows

Advantages: Exact likelihood, efficient sampling, latent space Disadvantages: Architectural constraints (must be invertible), Jacobian computation can be costly

Approach 3: Energy-Based Models

Define an unnormalized density (energy function) and train without computing the normalizing constant:

$$p_\theta(x) = \frac{\exp(-E_\theta(x))}{Z_\theta} \quad \text{where} \quad Z_\theta = \int \exp(-E_\theta(x)) dx$$

The energy $E_\theta(x)$ can be any neural network—no constraints! Low energy = high probability.

Problem: $Z_\theta$ is intractable (integral over all possible $x$).

Solutions:

Contrastive Divergence: Approximate gradient using MCMC samples
Score Matching: Match gradients of log-density instead
Noise Contrastive Estimation: Discriminate real from noise samples

Deep connection: Score-based diffusion models (state-of-the-art image generation) are intimately related to energy-based models through the score function $\nabla_x \log p(x)$.

Approach 4: Variational Autoencoders (VAEs)

Learn a latent variable model by optimizing a tractable lower bound on log-likelihood:

$$\log p(x) \geq \mathbb{E}{q(z|x)}[\log p(x|z)] - D{KL}(q(z|x) | p(z))$$

VAEs will be covered in depth in Module 2.

Comparison of Neural Density Estimation Approaches
Method	Exact Likelihood?	Sampling Speed	Latent Space?	Architecture Constraints
Autoregressive	Yes	Slow (sequential)	No (unless hybrid)	Causal masking
Normalizing Flows	Yes	Fast (one pass)	Yes (base distribution)	Must be invertible
Energy-Based	No (unnormalized)	Slow (MCMC)	No	None
VAE	No (lower bound)	Fast (one pass)	Yes (latent z)	Encoder-decoder

Maximum Likelihood: Deep Dive

Maximum likelihood estimation is so fundamental to density estimation that it deserves deeper examination. Understanding its properties, limitations, and connections to information theory is essential for modern generative modeling.

The Likelihood Function:

Given data $\mathcal{D} = {x_1, \ldots, x_n}$ and model $p_\theta$, the likelihood is:

$$L(\theta) = \prod_{i=1}^n p_\theta(x_i)$$

The log-likelihood is:

$$\ell(\theta) = \sum_{i=1}^n \log p_\theta(x_i)$$

MLE as KL Minimization:

A beautiful connection: minimizing KL divergence from true distribution to model is equivalent to maximizing likelihood:

$$D_{KL}(p^* | p_\theta) = \mathbb{E}{x \sim p^}[\log p^(x)] - \mathbb{E}{x \sim p^*}[\log p_\theta(x)]$$

The first term is the (negative) entropy of the true distribution—constant with respect to $\theta$. Minimizing KL is thus equivalent to maximizing $\mathbb{E}{x \sim p^*}[\log p\theta(x)]$, which is estimated by the sample log-likelihood.

Cross-Entropy Loss:

In practice, we minimize the cross-entropy:

$$H(p^, p_\theta) = -\mathbb{E}_{x \sim p^}[\log p_\theta(x)] \approx -\frac{1}{n}\sum_{i=1}^n \log p_\theta(x_i)$$

This is exactly the negative log-likelihood (per sample). Whether you call it 'maximizing likelihood' or 'minimizing cross-entropy,' it's the same optimization.

Bits Per Dimension

For image models, likelihood is often reported as 'bits per dimension' (bits/dim): the negative log-likelihood in base 2, divided by the number of dimensions (pixels). Lower is better. A bits/dim of 3.0 means the model uses 3 bits on average to encode each pixel. State-of-the-art models achieve ~2.5-3.5 bits/dim on natural images.

MLE Pathologies:

Despite its elegance, MLE has important limitations:

1. Mode-Seeking Behavior

Minimizing $D_{KL}(p^* | p_\theta)$ (forward KL) is mode-seeking: the model strongly penalizes assigning low probability to high-probability regions of $p^*$, but can ignore modes (assign them probability near zero).

This contrasts with reverse KL, $D_{KL}(p_\theta | p^*)$, which is mode-covering but rarely used in density estimation.

2. Overfitting

With powerful models (many parameters), MLE can memorize training data. For exact histogram density estimation, the estimate assigns probability $1/n$ to each unique training point—useless for generalization.

3. Infinite Likelihood Trap

For continuous data, likelihood can be made arbitrarily high by placing delta functions at training points. Regularization (priors, early stopping) is essential.

4. Blind to Perceptual Quality

MLE optimizes for log-probability, which doesn't directly correspond to human perception. A model can have high likelihood but generate blurry images, or low likelihood but great samples. This motivated alternative objectives like adversarial training (GANs).

mle_properties.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import numpy as np
from scipy.special import logsumexp
 
# Demonstrate MLE as KL minimization
np.random.seed(42)
 
# True distribution: mixture of Gaussians
def true_log_prob(x):
    """Log p*(x) for mixture of two Gaussians."""
    component1 = -0.5 * ((x - (-2)) ** 2) - 0.5 * np.log(2 * np.pi)
    component2 = -0.5 * ((x - 2) ** 2) - 0.5 * np.log(2 * np.pi)
    return logsumexp([component1 + np.log(0.5), component2 + np.log(0.5)])
 
# Model: single Gaussian (misspecified)
def model_log_prob(x, mu, sigma):
    """Log p_θ(x) for single Gaussian."""
    return -0.5 * ((x - mu) / sigma) ** 2 - np.log(sigma) - 0.5 * np.log(2 * np.pi)
 
# Generate samples from true distribution
n = 1000
true_samples = np.concatenate([
    np.random.normal(-2, 1, n // 2),
    np.random.normal(2, 1, n // 2)
])
 
# MLE for misspecified Gaussian model
mu_mle = true_samples.mean()  # Will be ~0 (between modes)
sigma_mle = true_samples.std()  # Will be ~2.2 (inflated to cover both)
 
print("MLE with Misspecified Model:")
print(f"  True distribution: Mixture of N(-2,1) and N(2,1)")
print(f"  Model family: Single Gaussian N(μ, σ²)")
print(f"\n  MLE estimates: μ = {mu_mle:.3f}, σ = {sigma_mle:.3f}")
print(f"  (μ ends up between modes, σ is inflated)")
 
# Compute likelihoods
sample_lls_true = np.array([true_log_prob(x) for x in true_samples])
sample_lls_model = model_log_prob(true_samples, mu_mle, sigma_mle)
 
print(f"\n  Mean log-likelihood under true distribution: {sample_lls_true.mean():.3f}")
print(f"  Mean log-likelihood under MLE model:        {sample_lls_model.mean():.3f}")
print(f"  Gap (≈ KL divergence estimate):             {sample_lls_true.mean() - sample_lls_model.mean():.3f}")
 
# Bits per dimension interpretation
bits_per_dim = -sample_lls_model.mean() / np.log(2)
print(f"\n  Bits per dimension (1D): {bits_per_dim:.3f}")
 
# Demonstrate mode-seeking behavior
x_test = np.linspace(-6, 6, 200)
true_density = np.exp(np.array([true_log_prob(x) for x in x_test]))
model_density = np.exp(model_log_prob(x_test, mu_mle, sigma_mle))
 
# Model assigns high probability to x=0 (between modes)
# but misses the actual modes at -2 and 2
print("\n  Mode-seeking behavior:")
print(f"    True density at x=0:  {np.exp(true_log_prob(0)):.4f}")
print(f"    Model density at x=0: {np.exp(model_log_prob(0, mu_mle, sigma_mle)):.4f}")
print(f"    True density at x=2:  {np.exp(true_log_prob(2)):.4f}")
print(f"    Model density at x=2: {np.exp(model_log_prob(2, mu_mle, sigma_mle)):.4f}")

Implicit Density Models

A radical departure from classical density estimation: implicit density models define distributions through a sampling procedure rather than an explicit density function.

The Key Insight:

Instead of learning $p_\theta(x)$ explicitly, learn a function $G_\theta: z \mapsto x$ that transforms noise into samples:

$$z \sim \mathcal{N}(0, I), \quad x = G_\theta(z)$$

The distribution of $x$ is implicitly defined by $G_\theta$ and the noise distribution. We can sample from it easily (forward pass through $G$), but we cannot evaluate $p_\theta(x)$ for arbitrary $x$.

Why abandon explicit density?

Tractability: Computing normalizing constants and Jacobians is hard. Implicit models sidestep this entirely.
Flexibility: Without density constraints, $G_\theta$ can be any neural network.
Sample quality: Empirically, implicit models (especially GANs) produce sharper samples than likelihood-based models.

The Trade-off:

Cannot compute $p_\theta(x)$—no likelihood evaluation
Cannot easily detect out-of-distribution inputs
Training requires alternative objectives (adversarial, moment matching)

What We Lose

Without explicit densities, important capabilities vanish: comparing likelihood of models via held-out data, computing exact marginals and conditionals, principled uncertainty quantification, and density-based anomaly detection. Implicit models trade these for flexibility and sample quality.

Generative Adversarial Networks (GANs):

The most famous implicit model. A generator $G$ tries to produce realistic samples, while a discriminator $D$ tries to distinguish real from fake:

$$\min_G \max_D \mathbb{E}{x \sim p{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]$$

At equilibrium (in theory), $G$ produces samples indistinguishable from real data, implying $p_G = p_{data}$.

Moment Matching:

Another implicit approach: match statistics (moments) of generated and real distributions:

$$\min_\theta \left| \mathbb{E}{x \sim p{data}}[\phi(x)] - \mathbb{E}{z \sim p_z}[\phi(G\theta(z))] \right|^2$$

where $\phi$ extracts features. This is the basis of Maximum Mean Discrepancy (MMD) training.

Hybrid Approaches:

Some models combine implicit and explicit elements:

BiGAN/ALI: GAN + encoder, enabling approximate inference
Flow-GAN: Normalizing flow with adversarial loss
VAE-GAN: VAE with discriminator for sharper samples

Explicit vs Implicit Density Models
Property	Explicit Models	Implicit Models
Density evaluation	Yes ($p_\theta(x)$ computable)	No (only sampling)
Training objective	Maximum likelihood	Adversarial / Moment matching
Architectural freedom	Often constrained	Fully flexible
Sample quality	Sometimes blurry	Often sharper
Mode coverage	Better (likelihood penalizes missing modes)	Worse (mode collapse common)
Anomaly detection	Natural via low probability	Difficult without density

Summary

Density estimation is the mathematical backbone of generative modeling. We've traversed from classical approaches to modern neural methods, illuminating how different techniques trade off expressiveness, tractability, and computational cost.

Key Takeaways

•Density estimation is learning a probability distribution $p_\theta(x)$ to approximate unknown $p^*(x)$ from samples.
•Parametric methods (Gaussian, exponential families) are sample-efficient but make strong assumptions that may not hold.
•Non-parametric methods (KDE, histograms) are flexible but suffer catastrophically from the curse of dimensionality.
•Mixture models (GMMs) interpolate between paradigms, offering flexibility with tractable EM training.
•Neural density estimation (autoregressive, flows, VAEs, energy-based) scales to high-dimensional data by leveraging deep learning's expressive power.
•Maximum likelihood is elegant and principled, but has mode-seeking behavior and may not correlate with perceptual quality.
•Implicit models (GANs) abandon explicit densities for sampling-based definitions, gaining flexibility at the cost of likelihood evaluation.

What's next:

Density estimation tells us how to evaluate the probability of data. But generative models must also sample—create new data instances from the learned distribution. In the next page, we'll explore sampling techniques: how to convert learned distributions into tangible new examples, from simple ancestral sampling to sophisticated MCMC methods.

Foundation Complete

You now have a rigorous understanding of density estimation—the mathematical core of generative modeling. Every generative model we study, from VAEs to diffusion models, is a specific answer to: 'How do we estimate P(x) for complex, high-dimensional data?' The techniques differ, but the goal remains density estimation.

Density Estimation

Learning the Shape of Data

What You Will Learn

Formal Definition

The Density Estimation Problem

$$x_i \stackrel{\text{i.i.d.}}{\sim} p^*(x)$$

Our goal is to construct an estimator $\hat{p}(x)$ such that $\hat{p}$ is 'close' to $p^*$ in some meaningful sense.

What does 'close' mean?

Different notions of closeness lead to different estimation procedures:

KL Divergence (Maximum Likelihood): $$D_{KL}(p^* | \hat{p}) = \mathbb{E}_{x \sim p^}\left[\log \frac{p^(x)}{\hat{p}(x)}\right]$$ Minimizing KL divergence is equivalent to maximizing log-likelihood.
Total Variation Distance: $$TV(p^, \hat{p}) = \frac{1}{2} \int |p^(x) - \hat{p}(x)| dx$$ Upper bounds classification error.
Wasserstein Distance: $$W_1(p^, \hat{p}) = \inf_{\gamma \in \Pi(p^, \hat{p})} \mathbb{E}_{(x,y) \sim \gamma}[|x - y|]$$ Captures geometric similarity; used in Wasserstein GANs.

Each metric has different properties. KL divergence is asymmetric and can be infinite if densities have different supports. Wasserstein is more robust but computationally harder.

Density vs. Mass

Why is density estimation hard?

The challenge scales dramatically with dimensionality. Consider:

1D: Estimate the distribution of human heights. We might need ~100 samples to get a reasonable histogram.
10D: Estimate the joint distribution of 10 medical measurements. The 'space' we need to cover grows exponentially—with 10 bins per dimension, we have $10^{10}$ cells to estimate.
1,000,000D: Estimate the distribution of 1000×1000 RGB images. Each image is a point in million-dimensional space. We will never have enough data to 'fill' this space.

Parametric Approaches

Parametric density estimation assumes the data comes from a known family of distributions with unknown parameters. We estimate the parameters from data.

The Parametric Model:

$$p(x) = p_\theta(x)$$

where $\theta \in \Theta$ are parameters to be learned (e.g., mean $\mu$ and variance $\sigma^2$ for a Gaussian).

Advantages:

Sample-efficient (fewer parameters to estimate)
Often computationally tractable
Provides interpretable parameters

Disadvantages:

Assumes a specific functional form
Model misspecification leads to systematic bias
Limited flexibility

Common Parametric Models:

Common Parametric Density Models
Model	Density	Parameters	Use Case
Univariate Gaussian	$\frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$	$\mu, \sigma^2$	Continuous, unimodal data
Multivariate Gaussian	$\frac{1}{(2\pi)^{d/2}\|\Sigma\|^{1/2}} \exp\left(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\right)$	$\mu \in \mathbb{R}^d, \Sigma \in \mathbb{R}^{d \times d}$	Correlated continuous data
Exponential	$\lambda e^{-\lambda x}$	$\lambda > 0$	Waiting times, lifespans
Categorical	$\prod_k p_k^{[x=k]}$	$p_1, \ldots, p_K$ (sum to 1)	Discrete categories
Poisson	$\frac{\lambda^k e^{-\lambda}}{k!}$	$\lambda > 0$	Count data

Maximum Likelihood Estimation (MLE)

The workhorse of parametric estimation is MLE. Given data $\mathcal{D} = {x_1, \ldots, x_n}$, we choose parameters that maximize the probability of observing this data:

$$\hat{\theta}{MLE} = \arg\max\theta \prod_{i=1}^n p_\theta(x_i) = \arg\max_\theta \sum_{i=1}^n \log p_\theta(x_i)$$

The log transformation converts products to sums, improving numerical stability and optimization.

MLE for Gaussian:

For data from $\mathcal{N}(\mu, \sigma^2)$:

$$\hat{\mu}{MLE} = \frac{1}{n} \sum{i=1}^n x_i = \bar{x}$$

$$\hat{\sigma}^2_{MLE} = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2$$

The MLE for the mean is the sample mean—intuitive and optimal. The variance estimator is biased (divide by $n-1$ for unbiased version), but consistent.

Properties of MLE:

Consistency: $\hat{\theta}_{MLE} \to \theta^*$ as $n \to \infty$ (if model is correctly specified)
Asymptotic efficiency: Achieves the Cramér-Rao lower bound
Asymptotic normality: $\sqrt{n}(\hat{\theta}_{MLE} - \theta^) \to \mathcal{N}(0, I^{-1}(\theta^))$

These properties make MLE the gold standard for parametric estimation.

parametric_mle.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import numpy as np
from scipy.stats import norm, multivariate_normal
 
# Generate data from known distribution
np.random.seed(42)
 
# True parameters
true_mu = 5.0
true_sigma = 2.0
n_samples = 1000
 
# Generate samples
X = np.random.normal(true_mu, true_sigma, size=n_samples)
 
# MLE estimation
mu_mle = X.mean()
sigma_mle = X.std()  # Note: numpy uses n divisor by default
sigma_unbiased = X.std(ddof=1)  # n-1 divisor for unbiased estimate
 
print("Univariate Gaussian MLE:")
print(f"  True parameters: μ = {true_mu}, σ = {true_sigma}")
print(f"  MLE estimates:   μ = {mu_mle:.3f}, σ = {sigma_mle:.3f}")
print(f"  Unbiased σ:      {sigma_unbiased:.3f}")
 
# Multivariate Gaussian MLE
d = 3
true_mean = np.array([1.0, 2.0, 3.0])
true_cov = np.array([
    [1.0, 0.5, 0.3],
    [0.5, 2.0, 0.4],
    [0.3, 0.4, 1.5]
])
 
X_multi = np.random.multivariate_normal(true_mean, true_cov, size=n_samples)
 
# MLE for multivariate Gaussian
mean_mle = X_multi.mean(axis=0)
cov_mle = np.cov(X_multi.T, bias=True)  # bias=True for MLE (n divisor)
 
print(f"\nMultivariate Gaussian MLE (d={d}):")
print(f"  True mean: {true_mean}")
print(f"  MLE mean:  {mean_mle.round(3)}")
print(f"\n  Frobenius distance between true and estimated cov:")
print(f"    {np.linalg.norm(true_cov - cov_mle):.4f}")
 
# Evaluate log-likelihood on held-out data
X_test = np.random.multivariate_normal(true_mean, true_cov, size=100)
 
# True model log-likelihood
ll_true = multivariate_normal(true_mean, true_cov).logpdf(X_test).mean()
# MLE model log-likelihood
ll_mle = multivariate_normal(mean_mle, cov_mle).logpdf(X_test).mean()
 
print(f"\n  Mean log-likelihood on test data:")
print(f"    True model: {ll_true:.3f}")
print(f"    MLE model:  {ll_mle:.3f}")

Non-Parametric Approaches

Non-parametric density estimation makes no fixed assumptions about the functional form of the distribution. The model complexity grows with data.

Philosophy:

Instead of assuming $p^*(x)$ belongs to a parametric family, we let the data 'speak for itself.' The density estimate is directly derived from the observed samples.

Histogram Estimator

The simplest non-parametric approach. Divide the space into bins and estimate density as the proportion of samples in each bin:

$$\hat{p}(x) = \frac{\text{count in bin containing } x}{n \cdot \text{bin width}}$$

Properties:

Extremely simple to implement
Density is normalized (integrates to 1)
Discontinuous (density jumps at bin boundaries)
Highly sensitive to bin placement and width

Kernel Density Estimation (KDE)

KDE smooths the histogram by placing a 'kernel' (smooth bump) at each data point:

$$\hat{p}(x) = \frac{1}{n h} \sum_{i=1}^n K\left(\frac{x - x_i}{h}\right)$$

where:

$K$ is a kernel function (e.g., Gaussian: $K(u) = \frac{1}{\sqrt{2\pi}} e^{-u^2/2}$)
$h$ is the bandwidth—a crucial hyperparameter

Bandwidth Selection:

Too small $h$: Overfitting—density is spiky, each point creates its own peak
Too large $h$: Underfitting—density is overly smooth, loses detail
Optimal $h$: Balances bias (smoothness) and variance (sensitivity to individual points)

Common approaches:

Silverman's rule: $h = 0.9 \cdot \min(\hat{\sigma}, IQR/1.34) \cdot n^{-1/5}$ (for 1D Gaussian kernel)
Cross-validation: Leave-one-out likelihood optimization

The Curse Strikes Again

k-Nearest Neighbors Density Estimation

An alternative non-parametric approach: instead of fixing bandwidth, fix the number of neighbors:

$$\hat{p}(x) = \frac{k}{n \cdot V_d(r_k(x))}$$

where $r_k(x)$ is the distance to the $k$-th nearest neighbor, and $V_d(r)$ is the volume of a $d$-dimensional ball of radius $r$.

Intuition: In dense regions, the $k$-th neighbor is close, so the ball has small volume, giving high density. In sparse regions, the ball is large, giving low density.

Properties:

Adaptive bandwidth (effectively)
Works better in regions with varying density
Still suffers from curse of dimensionality
Not guaranteed to integrate to 1

Comparison of Approaches:

Non-Parametric Density Estimators
Method	Smoothness	Key Hyperparameter	Curse Severity	Normalization
Histogram	Discontinuous	Bin width	Severe	Automatic
KDE	Smooth (kernel-dependent)	Bandwidth $h$	Severe	Automatic
k-NN	Smooth	Number of neighbors $k$	Severe	Not guaranteed

nonparametric_kde.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import numpy as np
from scipy.stats import gaussian_kde
 
# Generate data from bimodal distribution
np.random.seed(42)
n = 500
 
# Mixture of two Gaussians
mixture_prob = 0.4
component1 = np.random.normal(-2, 0.8, int(n * mixture_prob))
component2 = np.random.normal(2, 1.2, n - int(n * mixture_prob))
X = np.concatenate([component1, component2])
 
# Kernel density estimation with different bandwidths
x_grid = np.linspace(-6, 6, 200)
 
# Automatic bandwidth (Scott's rule)
kde_auto = gaussian_kde(X)
density_auto = kde_auto(x_grid)
 
# Smaller bandwidth (more detail, more variance)
kde_small = gaussian_kde(X, bw_method=0.1)
density_small = kde_small(x_grid)
 
# Larger bandwidth (smoother, more bias)
kde_large = gaussian_kde(X, bw_method=0.5)
density_large = kde_large(x_grid)
 
print("Kernel Density Estimation Comparison:")
print(f"  Sample size: {n}")
print(f"  Auto bandwidth (Scott): {kde_auto.factor:.4f}")
print(f"  Small bandwidth: 0.1 (relative)")
print(f"  Large bandwidth: 0.5 (relative)")
 
# True density for comparison
from scipy.stats import norm
true_density = (mixture_prob * norm.pdf(x_grid, -2, 0.8) + 
                (1 - mixture_prob) * norm.pdf(x_grid, 2, 1.2))
 
# Compute integrated squared error
ise_auto = np.trapz((density_auto - true_density)**2, x_grid)
ise_small = np.trapz((density_small - true_density)**2, x_grid)
ise_large = np.trapz((density_large - true_density)**2, x_grid)
 
print(f"\nIntegrated Squared Error (lower is better):")
print(f"  Auto bandwidth:  {ise_auto:.6f}")
print(f"  Small bandwidth: {ise_small:.6f}")
print(f"  Large bandwidth: {ise_large:.6f}")
 
# Demonstrate curse of dimensionality
print("\n--- Curse of Dimensionality ---")
for d in [1, 2, 5, 10, 20]:
    # Required samples for fixed error, roughly scales as exp(d)
    n_required = int(1000 * (5 ** (d-1)))
    print(f"  d={d:2d}: ~{n_required:,} samples needed for good KDE")

Mixture Models

Mixture models combine parametric flexibility with non-parametric-like expressiveness. They model the distribution as a weighted sum of simpler component distributions:

$$p(x) = \sum_{k=1}^K \pi_k \cdot p_k(x|\theta_k)$$

where:

$K$ is the number of components (clusters)
$\pi_k$ are mixing weights ($\sum_k \pi_k = 1$, $\pi_k \geq 0$)
$p_k(x|\theta_k)$ are component densities

Gaussian Mixture Models (GMMs):

The most common mixture model uses Gaussian components:

$$p(x) = \sum_{k=1}^K \pi_k \cdot \mathcal{N}(x | \mu_k, \Sigma_k)$$

Why mixtures are powerful:

Universal approximation: With enough components, GMMs can approximate any continuous density arbitrarily well.
Interpretable structure: Each component can represent a distinct 'mode' or cluster in the data.
Latent variable interpretation: The mixture can be viewed as a generative process:
- First, sample component $z \sim \text{Categorical}(\pi_1, \ldots, \pi_K)$
- Then, sample $x \sim p_z(x|\theta_z)$

This latent variable view foreshadows more complex generative models.

Expectation-Maximization (EM) Algorithm

Directly maximizing likelihood for mixtures is hard—we don't know which component generated each point. EM solves this by alternating:

E-step (Expectation): Compute soft assignments—probability that each point came from each component:

$$\gamma_{ik} = \frac{\pi_k \cdot p_k(x_i|\theta_k)}{\sum_j \pi_j \cdot p_j(x_i|\theta_j)}$$

M-step (Maximization): Update parameters using weighted statistics:

$$\pi_k^{new} = \frac{1}{n} \sum_i \gamma_{ik}$$

$$\mu_k^{new} = \frac{\sum_i \gamma_{ik} x_i}{\sum_i \gamma_{ik}}$$

$$\Sigma_k^{new} = \frac{\sum_i \gamma_{ik} (x_i - \mu_k^{new})(x_i - \mu_k^{new})^T}{\sum_i \gamma_{ik}}$$

Properties of EM:

Guarantees likelihood increases (or stays same) each iteration
Converges to local maximum (not necessarily global)
Simple to implement, but may be slow to converge
Sensitive to initialization

From GMMs to Deep Generative Models

gaussian_mixture_models.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import numpy as np
from sklearn.mixture import GaussianMixture
from scipy.stats import multivariate_normal
 
# Generate data from known mixture
np.random.seed(42)
n = 600
 
# True mixture: 3 components in 2D
true_weights = [0.3, 0.5, 0.2]
true_means = [np.array([-3, 0]), np.array([2, 2]), np.array([0, -3])]
true_covs = [
    np.array([[1, 0.5], [0.5, 1]]),
    np.array([[0.5, 0], [0, 2]]),
    np.array([[1.5, -0.3], [-0.3, 0.5]])
]
 
# Generate samples
X = []
true_labels = []
for i in range(n):
    k = np.random.choice(3, p=true_weights)
    true_labels.append(k)
    X.append(np.random.multivariate_normal(true_means[k], true_covs[k]))
X = np.array(X)
true_labels = np.array(true_labels)
 
# Fit GMM with known number of components
gmm = GaussianMixture(n_components=3, random_state=42)
gmm.fit(X)
 
print("Gaussian Mixture Model Results:")
print(f"  Number of samples: {n}")
print(f"  Number of components: 3")
print(f"\n  Converged: {gmm.converged_}")
print(f"  Iterations: {gmm.n_iter_}")
print(f"  Log-likelihood per sample: {gmm.score(X):.3f}")
 
print(f"\n  True mixing weights: {true_weights}")
print(f"  Learned weights:     {gmm.weights_.round(3).tolist()}")
 
# Evaluate density on a grid
x_range = np.linspace(-6, 5, 100)
y_range = np.linspace(-6, 5, 100)
X_grid, Y_grid = np.meshgrid(x_range, y_range)
positions = np.column_stack([X_grid.ravel(), Y_grid.ravel()])
 
# Compute learned density
log_density = gmm.score_samples(positions)
density = np.exp(log_density).reshape(X_grid.shape)
 
# Sample from the learned model
X_synthetic, _ = gmm.sample(200)
print(f"\n  Generated {len(X_synthetic)} synthetic samples")
 
# Model selection with BIC
print("\n--- Model Selection (BIC) ---")
bics = []
for k in range(1, 7):
    gmm_k = GaussianMixture(n_components=k, random_state=42)
    gmm_k.fit(X)
    bics.append(gmm_k.bic(X))
    print(f"  K={k}: BIC = {bics[-1]:.1f}")
print(f"  Best K (lowest BIC): {np.argmin(bics) + 1}")

Neural Density Estimation

The Core Challenge:

To define a valid probability density, we need:

$p(x) \geq 0$ for all $x$ (non-negativity)
$\int p(x) dx = 1$ (normalization)

For neural networks outputting arbitrary values, ensuring these constraints is non-trivial.

Approach 1: Autoregressive Models

Factor the joint density using the chain rule of probability:

$$p(x) = p(x_1) \cdot p(x_2|x_1) \cdot p(x_3|x_1, x_2) \cdots = \prod_{i=1}^d p(x_i | x_{<i})$$

Examples:

PixelCNN/PixelRNN: Model image pixels left-to-right, top-to-bottom
WaveNet: Model audio samples autoregressively
GPT: Model text tokens autoregressively

Advantages: Exact likelihood computation, powerful expressiveness Disadvantages: Sequential generation (slow), arbitrary ordering choice

The Ordering Problem

Approach 2: Normalizing Flows

Transform a simple base distribution (e.g., standard Gaussian) through invertible transformations to create a complex distribution:

$$z \sim \mathcal{N}(0, I) \quad \rightarrow \quad x = f_\theta(z)$$

If $f$ is invertible and differentiable, the density of $x$ follows from the change of variables formula:

$$p_X(x) = p_Z(f^{-1}(x)) \cdot \left|\det \frac{\partial f^{-1}}{\partial x}\right|$$

The Jacobian determinant accounts for how $f$ stretches or compresses space.

Key insight: By stacking many simple invertible transformations, we can build arbitrarily complex distributions while maintaining tractable density evaluation.

Examples: RealNVP, Glow, Neural Spline Flows

Advantages: Exact likelihood, efficient sampling, latent space Disadvantages: Architectural constraints (must be invertible), Jacobian computation can be costly

Approach 3: Energy-Based Models

Define an unnormalized density (energy function) and train without computing the normalizing constant:

$$p_\theta(x) = \frac{\exp(-E_\theta(x))}{Z_\theta} \quad \text{where} \quad Z_\theta = \int \exp(-E_\theta(x)) dx$$

The energy $E_\theta(x)$ can be any neural network—no constraints! Low energy = high probability.

Problem: $Z_\theta$ is intractable (integral over all possible $x$).

Solutions:

Contrastive Divergence: Approximate gradient using MCMC samples
Score Matching: Match gradients of log-density instead
Noise Contrastive Estimation: Discriminate real from noise samples

Deep connection: Score-based diffusion models (state-of-the-art image generation) are intimately related to energy-based models through the score function $\nabla_x \log p(x)$.

Approach 4: Variational Autoencoders (VAEs)

Learn a latent variable model by optimizing a tractable lower bound on log-likelihood:

$$\log p(x) \geq \mathbb{E}{q(z|x)}[\log p(x|z)] - D{KL}(q(z|x) | p(z))$$

VAEs will be covered in depth in Module 2.

Comparison of Neural Density Estimation Approaches
Method	Exact Likelihood?	Sampling Speed	Latent Space?	Architecture Constraints
Autoregressive	Yes	Slow (sequential)	No (unless hybrid)	Causal masking
Normalizing Flows	Yes	Fast (one pass)	Yes (base distribution)	Must be invertible
Energy-Based	No (unnormalized)	Slow (MCMC)	No	None
VAE	No (lower bound)	Fast (one pass)	Yes (latent z)	Encoder-decoder

Maximum Likelihood: Deep Dive

The Likelihood Function:

Given data $\mathcal{D} = {x_1, \ldots, x_n}$ and model $p_\theta$, the likelihood is:

$$L(\theta) = \prod_{i=1}^n p_\theta(x_i)$$

The log-likelihood is:

$$\ell(\theta) = \sum_{i=1}^n \log p_\theta(x_i)$$

MLE as KL Minimization:

A beautiful connection: minimizing KL divergence from true distribution to model is equivalent to maximizing likelihood:

$$D_{KL}(p^* | p_\theta) = \mathbb{E}{x \sim p^}[\log p^(x)] - \mathbb{E}{x \sim p^*}[\log p_\theta(x)]$$

Cross-Entropy Loss:

In practice, we minimize the cross-entropy:

$$H(p^, p_\theta) = -\mathbb{E}_{x \sim p^}[\log p_\theta(x)] \approx -\frac{1}{n}\sum_{i=1}^n \log p_\theta(x_i)$$

This is exactly the negative log-likelihood (per sample). Whether you call it 'maximizing likelihood' or 'minimizing cross-entropy,' it's the same optimization.

Bits Per Dimension

MLE Pathologies:

Despite its elegance, MLE has important limitations:

1. Mode-Seeking Behavior

This contrasts with reverse KL, $D_{KL}(p_\theta | p^*)$, which is mode-covering but rarely used in density estimation.

2. Overfitting

3. Infinite Likelihood Trap

For continuous data, likelihood can be made arbitrarily high by placing delta functions at training points. Regularization (priors, early stopping) is essential.

4. Blind to Perceptual Quality

mle_properties.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import numpy as np
from scipy.special import logsumexp
 
# Demonstrate MLE as KL minimization
np.random.seed(42)
 
# True distribution: mixture of Gaussians
def true_log_prob(x):
    """Log p*(x) for mixture of two Gaussians."""
    component1 = -0.5 * ((x - (-2)) ** 2) - 0.5 * np.log(2 * np.pi)
    component2 = -0.5 * ((x - 2) ** 2) - 0.5 * np.log(2 * np.pi)
    return logsumexp([component1 + np.log(0.5), component2 + np.log(0.5)])
 
# Model: single Gaussian (misspecified)
def model_log_prob(x, mu, sigma):
    """Log p_θ(x) for single Gaussian."""
    return -0.5 * ((x - mu) / sigma) ** 2 - np.log(sigma) - 0.5 * np.log(2 * np.pi)
 
# Generate samples from true distribution
n = 1000
true_samples = np.concatenate([
    np.random.normal(-2, 1, n // 2),
    np.random.normal(2, 1, n // 2)
])
 
# MLE for misspecified Gaussian model
mu_mle = true_samples.mean()  # Will be ~0 (between modes)
sigma_mle = true_samples.std()  # Will be ~2.2 (inflated to cover both)
 
print("MLE with Misspecified Model:")
print(f"  True distribution: Mixture of N(-2,1) and N(2,1)")
print(f"  Model family: Single Gaussian N(μ, σ²)")
print(f"\n  MLE estimates: μ = {mu_mle:.3f}, σ = {sigma_mle:.3f}")
print(f"  (μ ends up between modes, σ is inflated)")
 
# Compute likelihoods
sample_lls_true = np.array([true_log_prob(x) for x in true_samples])
sample_lls_model = model_log_prob(true_samples, mu_mle, sigma_mle)
 
print(f"\n  Mean log-likelihood under true distribution: {sample_lls_true.mean():.3f}")
print(f"  Mean log-likelihood under MLE model:        {sample_lls_model.mean():.3f}")
print(f"  Gap (≈ KL divergence estimate):             {sample_lls_true.mean() - sample_lls_model.mean():.3f}")
 
# Bits per dimension interpretation
bits_per_dim = -sample_lls_model.mean() / np.log(2)
print(f"\n  Bits per dimension (1D): {bits_per_dim:.3f}")
 
# Demonstrate mode-seeking behavior
x_test = np.linspace(-6, 6, 200)
true_density = np.exp(np.array([true_log_prob(x) for x in x_test]))
model_density = np.exp(model_log_prob(x_test, mu_mle, sigma_mle))
 
# Model assigns high probability to x=0 (between modes)
# but misses the actual modes at -2 and 2
print("\n  Mode-seeking behavior:")
print(f"    True density at x=0:  {np.exp(true_log_prob(0)):.4f}")
print(f"    Model density at x=0: {np.exp(model_log_prob(0, mu_mle, sigma_mle)):.4f}")
print(f"    True density at x=2:  {np.exp(true_log_prob(2)):.4f}")
print(f"    Model density at x=2: {np.exp(model_log_prob(2, mu_mle, sigma_mle)):.4f}")

Implicit Density Models

A radical departure from classical density estimation: implicit density models define distributions through a sampling procedure rather than an explicit density function.

The Key Insight:

Instead of learning $p_\theta(x)$ explicitly, learn a function $G_\theta: z \mapsto x$ that transforms noise into samples:

$$z \sim \mathcal{N}(0, I), \quad x = G_\theta(z)$$

Why abandon explicit density?

Tractability: Computing normalizing constants and Jacobians is hard. Implicit models sidestep this entirely.
Flexibility: Without density constraints, $G_\theta$ can be any neural network.
Sample quality: Empirically, implicit models (especially GANs) produce sharper samples than likelihood-based models.

The Trade-off:

Cannot compute $p_\theta(x)$—no likelihood evaluation
Cannot easily detect out-of-distribution inputs
Training requires alternative objectives (adversarial, moment matching)

What We Lose

Generative Adversarial Networks (GANs):

The most famous implicit model. A generator $G$ tries to produce realistic samples, while a discriminator $D$ tries to distinguish real from fake:

$$\min_G \max_D \mathbb{E}{x \sim p{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]$$

At equilibrium (in theory), $G$ produces samples indistinguishable from real data, implying $p_G = p_{data}$.

Moment Matching:

Another implicit approach: match statistics (moments) of generated and real distributions:

$$\min_\theta \left| \mathbb{E}{x \sim p{data}}[\phi(x)] - \mathbb{E}{z \sim p_z}[\phi(G\theta(z))] \right|^2$$

where $\phi$ extracts features. This is the basis of Maximum Mean Discrepancy (MMD) training.

Hybrid Approaches:

Some models combine implicit and explicit elements:

BiGAN/ALI: GAN + encoder, enabling approximate inference
Flow-GAN: Normalizing flow with adversarial loss
VAE-GAN: VAE with discriminator for sharper samples

Explicit vs Implicit Density Models
Property	Explicit Models	Implicit Models
Density evaluation	Yes ($p_\theta(x)$ computable)	No (only sampling)
Training objective	Maximum likelihood	Adversarial / Moment matching
Architectural freedom	Often constrained	Fully flexible
Sample quality	Sometimes blurry	Often sharper
Mode coverage	Better (likelihood penalizes missing modes)	Worse (mode collapse common)
Anomaly detection	Natural via low probability	Difficult without density

Summary

Key Takeaways

•Density estimation is learning a probability distribution $p_\theta(x)$ to approximate unknown $p^*(x)$ from samples.
•Parametric methods (Gaussian, exponential families) are sample-efficient but make strong assumptions that may not hold.
•Non-parametric methods (KDE, histograms) are flexible but suffer catastrophically from the curse of dimensionality.
•Mixture models (GMMs) interpolate between paradigms, offering flexibility with tractable EM training.
•Neural density estimation (autoregressive, flows, VAEs, energy-based) scales to high-dimensional data by leveraging deep learning's expressive power.
•Maximum likelihood is elegant and principled, but has mode-seeking behavior and may not correlate with perceptual quality.
•Implicit models (GANs) abandon explicit densities for sampling-based definitions, gaining flexibility at the cost of likelihood evaluation.

What's next:

Foundation Complete