Markov Random Fields - Learning Module

Loading content...

0/245

Energy-Based Models

From Physics to Deep Learning

The connection between Markov Random Fields and energy functions runs deep. In statistical physics, systems naturally evolve toward low-energy states. The Boltzmann distribution describes this phenomenon: configurations with lower energy are exponentially more probable.

This same mathematical framework underlies MRFs, where $\psi(\mathbf{x}) = \exp(-E(\mathbf{x}))$. Energy-Based Models (EBMs) embrace this perspective, defining probability distributions through energy functions rather than explicit potentials. This viewpoint has proven remarkably fruitful, connecting classical probabilistic modeling to modern deep learning approaches like contrastive learning and score-based generative models.

What You Will Learn

By the end of this page, you will understand the energy-based formulation of probability, master the Gibbs/Boltzmann distribution, explore classic EBMs like Hopfield networks and Boltzmann machines, and connect to modern deep EBMs and contrastive learning.

The Gibbs Distribution

The Gibbs distribution (also called Boltzmann distribution) is the foundation of energy-based modeling:

$$P(\mathbf{x}) = \frac{1}{Z} \exp\left(-\frac{E(\mathbf{x})}{T}\right)$$

where:

$E(\mathbf{x})$ is the energy function mapping configurations to real numbers
$T$ is the temperature parameter (often set to 1 in ML)
$Z = \sum_{\mathbf{x}} \exp(-E(\mathbf{x})/T)$ is the partition function

The Energy-Potential Duality

Every MRF can be written as an EBM and vice versa:

• Potential → Energy: $E(\mathbf{x}) = -\sum_C \log \psi_C(\mathbf{x}_C)$ • Energy → Potential: $\psi(\mathbf{x}) = \exp(-E(\mathbf{x}))$

The energy view emphasizes minimization (lower is better), while the potential view emphasizes maximization (higher is better).

Temperature and Its Effects:

High temperature ($T \to \infty$): Uniform distribution; all configurations equally likely
Low temperature ($T \to 0$): Concentrated on minimum energy configurations (MAP states)
Unit temperature ($T = 1$): Standard EBM formulation

Temperature provides a "knob" to interpolate between random sampling and optimization, used in techniques like simulated annealing.

gibbs_distribution.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import numpy as np
import matplotlib.pyplot as plt
from typing import Callable, List
 
class EnergyBasedModel:
    """
    General Energy-Based Model with Gibbs distribution.
    """
    
    def __init__(self, energy_func: Callable, state_space: List):
        self.energy = energy_func
        self.states = state_space
        self.energies = {s: energy_func(s) for s in state_space}
    
    def compute_Z(self, temperature: float = 1.0) -> float:
        """Compute partition function at given temperature."""
        return sum(np.exp(-E / temperature) for E in self.energies.values())
    
    def probability(self, state, temperature: float = 1.0) -> float:
        """Compute probability of a state."""
        Z = self.compute_Z(temperature)
        return np.exp(-self.energy(state) / temperature) / Z
    
    def distribution(self, temperature: float = 1.0) -> dict:
        """Return full probability distribution."""
        Z = self.compute_Z(temperature)
        return {s: np.exp(-E / temperature) / Z 
                for s, E in self.energies.items()}
    
    def sample_gibbs(self, n_samples: int, temperature: float = 1.0):
        """Sample from the distribution."""
        probs = self.distribution(temperature)
        states = list(probs.keys())
        weights = [probs[s] for s in states]
        indices = np.random.choice(len(states), size=n_samples, p=weights)
        return [states[i] for i in indices]
 
 
# Example: Simple energy landscape
def quadratic_energy(x):
    """E(x) = (x - 2)^2, minimum at x=2."""
    return (x - 2) ** 2
 
states = list(range(-5, 10))
model = EnergyBasedModel(quadratic_energy, states)
 
print("Temperature Effects on Distribution:")
print("=" * 50)
 
for T in [0.5, 1.0, 2.0, 5.0]:
    dist = model.distribution(T)
    mode = max(dist, key=dist.get)
    entropy = -sum(p * np.log(p + 1e-10) for p in dist.values())
    print(f"T={T}: mode={mode}, entropy={entropy:.2f}")

Classic Energy-Based Models

The EBM framework encompasses many classical models with rich histories in physics, neuroscience, and machine learning.

The Ising Model:

The simplest non-trivial MRF, from statistical physics:

$$E(\mathbf{s}) = -\sum_{(i,j) \in E} J_{ij} s_i s_j - \sum_i h_i s_i$$

where $s_i \in {-1, +1}$ are spins, $J_{ij}$ are coupling strengths, and $h_i$ are external fields.

Ferromagnetic ($J > 0$): Spins prefer alignment
Antiferromagnetic ($J < 0$): Spins prefer anti-alignment
Phase transitions: Exhibits critical behavior at certain temperatures

Hopfield Networks:

A neural model for associative memory:

$$E(\mathbf{s}) = -\frac{1}{2} \sum_{i \neq j} w_{ij} s_i s_j$$

where weights encode stored patterns. The network dynamics minimize energy, settling into stored memories as local minima.

Boltzmann Machines:

A stochastic neural network with visible (v) and hidden (h) units:

$$E(\mathbf{v}, \mathbf{h}) = -\sum_i a_i v_i - \sum_j b_j h_j - \sum_{i,j} w_{ij} v_i h_j$$

Boltzmann machines can learn complex distributions over visible units by marginalizing over hidden units. Restricted Boltzmann Machines (RBMs) limit connections for tractability.

classic_ebms.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import numpy as np
 
class IsingModel:
    """2D Ising model on a grid."""
    
    def __init__(self, size: int, coupling: float = 1.0, field: float = 0.0):
        self.size = size
        self.J = coupling
        self.h = field
        self.spins = np.random.choice([-1, 1], size=(size, size))
    
    def energy(self) -> float:
        """Compute total energy."""
        E = 0.0
        for i in range(self.size):
            for j in range(self.size):
                # Interaction with right and down neighbors
                if j < self.size - 1:
                    E -= self.J * self.spins[i, j] * self.spins[i, j+1]
                if i < self.size - 1:
                    E -= self.J * self.spins[i, j] * self.spins[i+1, j]
                E -= self.h * self.spins[i, j]
        return E
    
    def magnetization(self) -> float:
        """Average spin (order parameter)."""
        return np.mean(self.spins)
    
    def metropolis_step(self, temperature: float):
        """One Metropolis-Hastings update step."""
        i, j = np.random.randint(0, self.size, 2)
        
        # Energy change if we flip spin at (i,j)
        neighbors_sum = 0
        for di, dj in [(-1,0), (1,0), (0,-1), (0,1)]:
            ni, nj = i + di, j + dj
            if 0 <= ni < self.size and 0 <= nj < self.size:
                neighbors_sum += self.spins[ni, nj]
        
        delta_E = 2 * self.spins[i, j] * (self.J * neighbors_sum + self.h)
        
        # Accept with Metropolis probability
        if delta_E < 0 or np.random.random() < np.exp(-delta_E / temperature):
            self.spins[i, j] *= -1
    
    def simulate(self, temperature: float, n_steps: int):
        """Run simulation and track observables."""
        energies, mags = [], []
        for step in range(n_steps):
            for _ in range(self.size ** 2):  # Sweep
                self.metropolis_step(temperature)
            if step % 100 == 0:
                energies.append(self.energy())
                mags.append(abs(self.magnetization()))
        return energies, mags
 
 
# Demonstrate Ising phase transition
print("Ising Model Simulation:")
print("=" * 50)
 
model = IsingModel(size=10, coupling=1.0)
 
for T in [1.0, 2.27, 4.0]:  # Below, near, above critical temperature
    model.spins = np.random.choice([-1, 1], size=(10, 10))
    energies, mags = model.simulate(T, 500)
    print(f"T={T:.2f}: final |M|={mags[-1]:.3f}, E={energies[-1]:.1f}")

Learning Energy-Based Models

Training an EBM means adjusting its energy function so that low energy is assigned to observed data and high energy to other configurations.

Maximum Likelihood Gradient

For log-linear EBMs with $E(\mathbf{x}; \theta) = -\theta \cdot f(\mathbf{x})$:

$$\nabla_\theta \log P(\mathbf{x}{data}) = f(\mathbf{x}{data}) - \mathbb{E}{P\theta}[f(\mathbf{X})]$$

The gradient is the difference between data features and model expectations. This is the famous positive phase minus negative phase update.

The Contrastive Divergence Algorithm:

Computing the model expectation exactly requires sampling from the model—expensive! Contrastive Divergence (CD) approximates it:

Start from a data point $\mathbf{x}^{(0)} = \mathbf{x}_{data}$
Run $k$ steps of MCMC to get $\mathbf{x}^{(k)}$
Update: $\theta \leftarrow \theta + \eta (f(\mathbf{x}^{(0)}) - f(\mathbf{x}^{(k)}))$

CD-1 (one step) is fast but biased. More steps reduce bias.

ebm_learning.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
class ContrastiveDivergence:
    """
    Contrastive Divergence learning for Restricted Boltzmann Machines.
    """
    
    def __init__(self, n_visible: int, n_hidden: int):
        self.n_v = n_visible
        self.n_h = n_hidden
        # Initialize weights
        self.W = np.random.randn(n_visible, n_hidden) * 0.01
        self.a = np.zeros(n_visible)   # Visible biases
        self.b = np.zeros(n_hidden)    # Hidden biases
    
    def sigmoid(self, x):
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
    
    def sample_hidden(self, v: np.ndarray) -> np.ndarray:
        """P(h|v) for RBM."""
        prob_h = self.sigmoid(v @ self.W + self.b)
        return (np.random.random(self.n_h) < prob_h).astype(float), prob_h
    
    def sample_visible(self, h: np.ndarray) -> np.ndarray:
        """P(v|h) for RBM."""
        prob_v = self.sigmoid(h @ self.W.T + self.a)
        return (np.random.random(self.n_v) < prob_v).astype(float), prob_v
    
    def cd_step(self, v_data: np.ndarray, k: int = 1, lr: float = 0.01):
        """
        One step of CD-k learning.
        """
        # Positive phase: data -> hidden
        h_data, prob_h_data = self.sample_hidden(v_data)
        positive_grad = np.outer(v_data, prob_h_data)
        
        # Negative phase: k steps of Gibbs sampling
        v_sample = v_data.copy()
        for _ in range(k):
            h_sample, _ = self.sample_hidden(v_sample)
            v_sample, prob_v_sample = self.sample_visible(h_sample)
        
        _, prob_h_model = self.sample_hidden(v_sample)
        negative_grad = np.outer(v_sample, prob_h_model)
        
        # Update weights
        self.W += lr * (positive_grad - negative_grad)
        self.a += lr * (v_data - v_sample)
        self.b += lr * (prob_h_data - prob_h_model)
    
    def energy(self, v: np.ndarray) -> float:
        """Compute free energy F(v) = -log Σ_h exp(-E(v,h))."""
        hidden_term = np.sum(np.log(1 + np.exp(v @ self.W + self.b)))
        return -np.dot(v, self.a) - hidden_term
 
 
# Demo
print("\nContrastive Divergence Training:")
print("=" * 50)
 
rbm = ContrastiveDivergence(n_visible=6, n_hidden=3)
 
# Synthetic data: two patterns
patterns = [
    np.array([1, 1, 1, 0, 0, 0]),
    np.array([0, 0, 0, 1, 1, 1])
]
 
for epoch in range(100):
    for pattern in patterns:
        rbm.cd_step(pattern, k=1, lr=0.1)
 
print("Learned energies:")
for p in patterns:
    print(f"  Pattern {p}: E = {rbm.energy(p):.2f}")
print(f"  Random  {np.array([1,0,1,0,1,0])}: E = {rbm.energy(np.array([1,0,1,0,1,0])):.2f}")

Modern Deep Energy-Based Models

Recent years have seen a renaissance of EBMs, now with deep neural networks parameterizing the energy function.

Modern EBM Approaches

•Deep EBMs: Use neural networks $E_\theta(\mathbf{x}) = f_\theta(\mathbf{x})$ where $f$ is a deep network. Very expressive but challenging to train.
•Contrastive Learning: Learn representations by contrasting positive pairs (similar) against negative pairs (different). Related to noise-contrastive estimation.
•Score-Based Models: Model the score function $\nabla_\mathbf{x} \log p(\mathbf{x})$ directly, avoiding partition function computation.
•Diffusion Models: Define a forward noising process and learn to reverse it. Closely connected to score-based EBMs.

The Score Function Perspective

The score $\nabla_\mathbf{x} \log p(\mathbf{x}) = -\nabla_\mathbf{x} E(\mathbf{x})$ doesn't depend on $Z$! Score matching and denoising score matching train EBMs without computing the partition function, enabling modern deep generative models.

Noise Contrastive Estimation:

NCE sidesteps partition function computation by converting density estimation to classification:

Mix data samples with noise samples from a known distribution $q$
Train a classifier to distinguish data from noise
The optimal classifier gives the density ratio $p/q$

This principle underlies many modern contrastive learning methods.

Applications of Energy-Based Models

EBM Applications Across Domains
Domain	Application	Energy Function Design
Computer Vision	Image denoising, segmentation	Pairwise smoothness + data fidelity
Natural Language	Language modeling, parsing	Feature-based log-linear models
Computational Biology	Protein structure prediction	Physical force fields + statistical potentials
Robotics	Imitation learning	Energy landscapes over trajectories
Recommendation	Preference modeling	User-item interaction energies

The EBM framework unifies diverse problems under a common mathematical umbrella: define an energy function capturing what makes configurations good or bad, and let the Gibbs distribution handle the rest.

Summary: Energy-Based Perspectives

Key Takeaways

•EBMs define distributions via energy — $P(x) \propto \exp(-E(x))$; low energy = high probability.
•Temperature controls sharpness — High T gives uniform distributions; low T concentrates on minima.
•Classic EBMs include Ising, Hopfield, Boltzmann machines — Rich history in physics and neural networks.
•Learning involves positive and negative phases — Push down energy of data, push up energy of model samples.
•Modern EBMs use deep networks — Contrastive learning and score-based methods avoid computing Z.

Module Complete

Congratulations! You've completed the Markov Random Fields module. You now understand undirected graphs, potential functions, the partition function, the Hammersley-Clifford theorem, and energy-based models—the complete theoretical foundation for working with MRFs.