Ml Landscape - Learning Module

Loading content...

0/245

Generative Modeling

Learning to Synthesize New Data

Generative modeling represents one of the most ambitious goals in machine learning: learning to generate new data that resembles training data. Rather than predicting labels or values for given inputs, generative models learn the underlying distribution of the data itself.

Consider what this enables:

Generate photorealistic images of people who don't exist
Synthesize natural language indistinguishable from human writing
Create new music in the style of a composer
Design novel molecules with desired properties
Augment training data for other machine learning tasks

Generative models answer the question: "What does data from this domain look like?" — capturing the full complexity of high-dimensional distributions.

What You Will Master

By the end of this page, you will understand generative modeling at a foundational level: the formal distinction between discriminative and generative models, density estimation and latent variable approaches, major paradigms (VAEs, GANs, autoregressive models, diffusion models), evaluation challenges, and the connection to other ML problem types. This knowledge enables you to understand and apply the most powerful generative technologies.

Discriminative vs. Generative Models

Understanding generative models requires contrasting them with the discriminative models we've seen for classification and regression.

Discriminative Models:

Model the conditional distribution $p(y | x)$:

Given input $x$, what is the probability of each output $y$?
Examples: Logistic regression, neural network classifiers, CRFs
Focus: Decision boundaries between classes
Don't model how $x$ itself is distributed

Generative Models:

Model the joint distribution $p(x, y)$ or simply $p(x)$:

How is the data $(x, y)$ generated?
Examples: Naive Bayes, Hidden Markov Models, VAEs, GANs
Can generate new samples by sampling from the learned distribution
Can compute $p(y|x)$ via Bayes' rule: $p(y|x) = p(x,y)/p(x) = p(x|y)p(y)/p(x)$

Discriminative vs. Generative Model Comparison
Aspect	Discriminative	Generative
What's modeled	$p(y\|x)$ — conditional	$p(x, y)$ or $p(x)$ — joint/marginal
Goal	Predict labels given inputs	Model data distribution
Can generate new data?	No	Yes
Uses of x	Only as conditioning input	Modeled explicitly
Typical accuracy	Often higher for classification	May be lower (harder problem)
Data efficiency	Needs labeled data	Can use unlabeled data
Examples	Logistic regression, SVM, neural classifiers	Naive Bayes, VAE, GAN, GPT

The Generative Advantage

Generative models offer unique capabilities: data augmentation (generate more training examples), semi-supervised learning (leverage unlabeled data), anomaly detection (flag low-probability samples), imputation (fill in missing values), and creative applications (generate novel content). They solve a harder problem but enable more diverse applications.

The Modeling Challenge:

Generative modeling faces a fundamental challenge: real-world data distributions are incredibly complex.

Consider images:

A 256×256 RGB image has 196,608 dimensions
Each dimension takes 256 values → $256^{196,608}$ possible images
But only a tiny fraction look like real images
The set of 'natural images' forms a complex, low-dimensional manifold

Generative models must learn to concentrate probability on this manifold while assigning low probability elsewhere. This is extraordinarily difficult—and why generative modeling has driven so much innovation.

Density Estimation: Explicit Modeling of p(x)

The most direct approach to generative modeling: explicitly estimate the probability density function $p(x)$.

Why Density Estimation?

With an explicit density:

Sample: Generate new data by sampling from $p(x)$
Evaluate: Compute $p(x_{new})$ to assess plausibility
Compare: Likelihood-based model comparison
Compress: Optimal coding length = $-\log p(x)$ bits

The Challenge:

In high dimensions, density estimation is extremely hard. Histograms don't scale. Kernel density estimation suffers from the curse of dimensionality. We need parametric models with tractable densities.

Autoregressive Models:

Decompose the joint density using the chain rule: $$p(x) = p(x_1) \cdot p(x_2 | x_1) \cdot p(x_3 | x_1, x_2) \cdots = \prod_{i=1}^{d} p(x_i | x_{<i})$$

Model each conditional $p(x_i | x_{<i})$ with a neural network.

Examples:

MADE (Masked Autoencoder for Distribution Estimation): Carefully mask weights to enforce autoregressive property
PixelCNN/PixelRNN: Autoregressive image models; generate one pixel at a time
WaveNet: Autoregressive audio generation at sample level
GPT (Generative Pre-trained Transformer): Autoregressive language model

Advantages: Exact likelihoods; stable training; flexible architecture Disadvantages: Sequential generation is slow; can't easily generate in parallel

Normalizing Flows:

Transform a simple base distribution (e.g., Gaussian) through a series of invertible transformations: $$x = f_K \circ f_{K-1} \circ \cdots \circ f_1(z), \quad z \sim p_Z(z)$$

The density of $x$ is computed via change of variables: $$\log p(x) = \log p_Z(z) - \sum_{k=1}^{K} \log \left| \det \frac{\partial f_k}{\partial f_{k-1}} \right|$$

For this to be tractable, transformations must have:

Efficient forward computation (sampling)
Efficient inverse computation (likelihood evaluation)
Efficient Jacobian determinant

Examples: RealNVP, Glow, Neural Spline Flows

Advantages: Exact likelihoods; efficient sampling (parallel); invertible Disadvantages: Architectural constraints limit expressiveness; memory-intensive

When to Use Explicit Density Models

Use autoregressive models when: generation speed is acceptable, exact likelihoods are needed, domain has natural ordering (text, audio). Use normalizing flows when: parallel sampling is needed, invertibility matters, exact likelihoods are needed. Both are excellent for anomaly detection (low likelihood = anomaly).

Latent Variable Models

Many generative models posit latent (hidden) variables that explain observed data. The intuition: complex data arises from simpler, unobserved factors.

The Latent Variable Framework:

$$p(x) = \int p(x | z) p(z) , dz$$

where:

$z$ = latent variable (low-dimensional, often Gaussian prior)
$p(z)$ = prior over latent space
$p(x|z)$ = decoder/generator: maps latent code to data

Interpretation:

$z$ captures the 'essence' or 'code' of the data
Different $z$ values generate different samples
The latent space often has meaningful structure (interpolation, disentanglement)

Variational Autoencoders (VAEs):

VAEs learn both:

A decoder (generator): $p_\theta(x | z)$ that generates data from latent codes
An encoder (inference): $q_\phi(z | x)$ that infers latent codes from data

The ELBO (Evidence Lower Bound):

Since $\log p(x) = \text{ELBO} + D_{KL}(q_\phi(z|x) | p(z|x))$, and KL ≥ 0: $$\log p(x) \geq \mathbb{E}{q\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x) | p(z))$$

Maximizing the ELBO:

Terms 1: Reconstruction loss (decoded samples should match input)
Term 2: KL regularization (posterior should match prior)

Training via reparameterization:

To backpropagate through sampling, use the reparameterization trick: $$z = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$$

VAE Strengths

•Principled probabilistic framework
•Tractable (lower bound) likelihood
•Meaningful latent space with smooth interpolation
•Stable training (no adversarial dynamics)
•Inference network enables encoding

VAE Limitations

•Blurry sample quality (especially images)
•Posterior collapse (encoder ignores input)
•ELBO is a lower bound, not exact likelihood
•Gaussian assumptions may not match data
•Trade-off between reconstruction and KL

The Blurriness Problem

VAEs with pixel-wise reconstruction loss tend to produce blurry images because the model averages over uncertainty. Solutions include: perceptual losses (compare in feature space), adversarial losses (add GAN component), hierarchical VAEs, or using VAEs for latent space and other decoders for images.

Generative Adversarial Networks (GANs)

GANs revolutionized generative modeling with a simple but powerful idea: train a generator by making it fool a discriminator.

The GAN Framework:

Generator $G$: Maps random noise $z$ to generated samples $G(z)$
Discriminator $D$: Classifies samples as real (from data) or fake (from generator)
Training as a minimax game:

$$\min_G \max_D , \mathbb{E}{x \sim p{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]$$

Intuition:

$D$ wants to correctly classify real vs. fake
$G$ wants $D$ to classify fake as real
At equilibrium, $G$ generates samples indistinguishable from real data

Why GANs Produce Sharp Images:

Unlike VAEs, GANs don't use pixel-wise reconstruction loss. The discriminator provides a learned loss—it evaluates whether the sample 'looks real' according to learned features. This naturally encourages:

Sharp edges (blurry images are easy to detect)
Realistic textures (discriminator learns texture statistics)
Coherent global structure (discriminator sees entire image)

GAN Training Challenges:

Mode collapse: Generator produces limited variety, ignoring modes of the data distribution
Training instability: Generator and discriminator may not converge; oscillations common
Vanishing gradients: If discriminator is too good, generator gets no learning signal
Hyperparameter sensitivity: Learning rates, architectures, batch sizes all matter greatly
No likelihood: Can't evaluate $p(x)$ for arbitrary samples

GAN Variants and Improvements
Variant	Key Innovation	Addresses
DCGAN	Deep convolutional architecture; batch norm; training guidelines	Image quality, training stability
WGAN	Wasserstein distance instead of JS divergence	Stability, mode collapse
WGAN-GP	Gradient penalty instead of weight clipping	Training stability
Progressive GAN	Grow resolution during training	High-resolution generation
StyleGAN	Style-based generator; AdaIN	Quality, controllability
StyleGAN2/3	Fixes artifacts; improved architecture	State-of-the-art image quality
BigGAN	Large scale; class conditioning	ImageNet synthesis
CycleGAN	Unpaired image-to-image translation	Domain transfer without paired data

GAN Mode Collapse

Mode collapse occurs when the generator learns to produce only a few types of samples that fool the discriminator, ignoring the full diversity of the data. Diagnostics: check generated sample diversity; compute metrics like FID that capture distribution coverage. Solutions: minibatch discrimination, unrolled GANs, or use alternative generative approaches.

Diffusion Models

Diffusion models have emerged as the new state-of-the-art for image generation, powering systems like DALL-E 2, Stable Diffusion, and Imagen.

Core Idea:

Forward process: Gradually add noise to data over many steps until it becomes pure noise
Reverse process: Learn to denoise step-by-step, recovering data from noise

Mathematically:

Forward (fixed): $$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)$$

After many steps, $x_T \approx \mathcal{N}(0, I)$.

Reverse (learned): $$p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))$$

Train a neural network to predict the noise added at each step.

Training Objective:

Simplified objective (denoising score matching): $$\mathcal{L} = \mathbb{E}{t, x_0, \epsilon}\left[ |\epsilon - \epsilon\theta(x_t, t)|^2 \right]$$

where $\epsilon_\theta$ predicts the noise added to $x_0$ to get $x_t$.

Generation:

Sample $x_T \sim \mathcal{N}(0, I)$
For $t = T, T-1, \ldots, 1$: sample $x_{t-1} \sim p_\theta(x_{t-1} | x_t)$
Return $x_0$

Key insight: Each denoising step is a small, well-conditioned problem. The model isn't asked to generate everything at once—it progressively refines.

Diffusion Model Strengths

•State-of-the-art sample quality
•Stable training (no adversarial dynamics)
•Mode coverage (doesn't collapse like GANs)
•Tractable likelihood (variational bound)
•Strong conditioning capabilities (text, class)

Diffusion Model Limitations

•Slow sampling (many denoising steps)
•Computationally expensive training
•Large model sizes required
•Difficult to control precisely
•Latent space less interpretable than VAEs

Accelerating Diffusion:

DDIM (Denoising Diffusion Implicit Models): Deterministic sampling; skip steps
Latent Diffusion (Stable Diffusion): Diffuse in compressed latent space, not pixel space
Distillation: Train faster student models to mimic teacher
Consistency Models: Single-step generation via consistency constraints

Guided Generation:

Classifier guidance: Use gradients from a classifier to steer generation
Classifier-free guidance: Train with and without conditioning; interpolate at inference
Text conditioning: CLIP or T5 embeddings guide image generation

Why Diffusion Models Dominate

Diffusion models succeed because: (1) they break the hard problem of generation into many easy denoising problems, (2) training is stable (just MSE on noise prediction), (3) they naturally cover all modes (no collapse), and (4) they scale well with compute. The trade-off is speed—expect continued innovation in fast sampling.

Evaluating Generative Models

Evaluating generative models is notoriously difficult. Unlike classification (accuracy) or regression (MSE), there's no single number that captures 'quality of generation.'

What Should We Measure?

Sample Quality: Are individual samples realistic?
Sample Diversity: Does the model capture the full distribution?
Generalization: Does it generate novel samples, not just memorize training data?
Downstream Performance: Are generated samples useful for tasks?

Generative Model Evaluation Metrics
Metric	What It Measures	Limitations
Log-Likelihood	How probable is test data under model	Not always available (GANs); can be high despite bad samples
Inception Score (IS)	Quality and diversity via classifier	Insensitive to intra-class diversity; ignores training data
Fréchet Inception Distance (FID)	Distance between real/generated feature distributions	Requires many samples; assumes Gaussian; depends on classifier
Precision & Recall	Fidelity (precision) and coverage (recall)	Requires density estimation; interpretable trade-off
LPIPS	Perceptual similarity	For reconstruction; measures diversity indirectly
Human Evaluation	Ultimate test of quality	Expensive, slow, subjective
Downstream Task Performance	Utility for applications	Task-specific; doesn't measure generation quality directly

Fréchet Inception Distance (FID):

The most widely used metric for image generation:

Extract features from real images using Inception network
Extract features from generated images
Fit Gaussian to each feature set
Compute Fréchet distance between Gaussians:

$$FID = |\mu_r - \mu_g|^2 + \text{Tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2})$$

Lower FID = Better. FID of 0 means identical distributions.

Limitations:

Requires thousands of samples for stable estimate
Assumes Gaussian feature distribution
Depends on specific Inception model
Doesn't capture all aspects of quality

Goodhart's Law Applies

When a measure becomes a target, it ceases to be a good measure. Models can be optimized specifically to reduce FID while missing other important aspects of generation quality. Always use multiple metrics, include human evaluation for important applications, and remember that metrics are proxies, not ground truth.

Applications of Generative Models

Generative models power an expanding range of applications, from creative tools to scientific discovery.

Content Creation:

Image synthesis: Art, design, marketing materials (Midjourney, DALL-E)
Text generation: Writing assistance, chatbots, code generation (GPT, Claude)
Audio/Music: Voice synthesis, music composition (WaveNet, Jukebox)
Video: Animation, film effects, synthetic media
3D: Object and scene generation for gaming, simulation

Generative Applications Across Domains

•Data Augmentation: Generate synthetic training data to improve classifier performance, especially for rare classes
•Drug Discovery: Generate novel molecular structures with desired properties; predict drug-target interactions
•Protein Design: Generate protein sequences and structures (AlphaFold combines with generative approaches)
•Anomaly Detection: Low-likelihood samples under generative model indicate anomalies
•Imputation: Fill in missing data by sampling from conditional distribution
•Simulation: Generate realistic scenarios for training autonomous systems, testing edge cases
•Compression: Learned compression using generative models can outperform hand-designed codecs
•Semi-supervised Learning: Use unlabeled data by incorporating generative model of inputs

Conditional Generation:

Often we want to generate samples with specific properties:

Class-conditional: Generate images of a specific class (dog, car, etc.)
Text-conditional: Generate images from text descriptions
Image-conditional: Image-to-image translation (sketch → photo, day → night)
Inpainting: Fill in missing regions of images
Super-resolution: Generate high-resolution from low-resolution
Style transfer: Generate content with specified style

Conditional generation transforms generative models from random samplers to controllable creative tools.

Ethical Considerations

Generative models raise significant ethical concerns: deepfakes for misinformation, copyright issues with training data, job displacement in creative industries, bias amplification from training data. Responsible deployment requires: watermarking generated content, consent for training data, bias auditing, and careful consideration of societal impact.

Summary: Generative Modeling in the ML Landscape

We've explored generative modeling comprehensively—from discriminative/generative distinctions through density estimation, latent variable models, GANs, diffusion models, and evaluation. Let's synthesize the key insights.

Key Takeaways

•Generative models learn data distributions — They model $p(x)$ or $p(x, y)$, enabling synthesis of new samples.
•Multiple paradigms exist with different trade-offs — Autoregressive (exact likelihood, slow), flows (exact, architectural constraints), VAEs (tractable, blurry), GANs (sharp, unstable), diffusion (high quality, slow).
•Latent variable models provide interpretable structure — VAEs learn compressed representations; interpolation in latent space is meaningful.
•GANs use adversarial training for sharpness — The discriminator provides a learned loss that encourages realism.
•Diffusion models decompose generation into denoising — Many easy steps replace one hard step; currently state-of-the-art.
•Evaluation is fundamentally challenging — FID, IS, likelihood each capture partial aspects; human evaluation remains important.
•Applications span creation and science — From art generation to drug discovery, generative models unlock new capabilities.

Connection to Other Problem Types:

Generative modeling integrates with the full ML landscape:

Classification/Regression: Generative models can be used for classification via Bayes' rule; they also augment training data.
Clustering: Mixture models are both generative and clustering; latent spaces reveal clusters.
Structured Prediction: Autoregressive generation is structured prediction over sequences.
Representation Learning: Generative model encoders (VAE, diffusion) learn useful representations for downstream tasks.

Completing the ML Landscape:

We've now surveyed the five major ML problem types: regression (continuous), classification (discrete), clustering (unsupervised grouping), structured prediction (complex outputs), and generative modeling (data synthesis). Together, these form the conceptual foundation for understanding any machine learning task.

Module Complete

You now possess a comprehensive understanding of the machine learning landscape. From regression through generative modeling, you can recognize, formulate, and approach any ML problem type with clarity and precision. This foundational knowledge prepares you for deep dives into specific algorithms, architectures, and applications throughout your ML journey.