Loading content...
The connection between Markov Random Fields and energy functions runs deep. In statistical physics, systems naturally evolve toward low-energy states. The Boltzmann distribution describes this phenomenon: configurations with lower energy are exponentially more probable.
This same mathematical framework underlies MRFs, where $\psi(\mathbf{x}) = \exp(-E(\mathbf{x}))$. Energy-Based Models (EBMs) embrace this perspective, defining probability distributions through energy functions rather than explicit potentials. This viewpoint has proven remarkably fruitful, connecting classical probabilistic modeling to modern deep learning approaches like contrastive learning and score-based generative models.
By the end of this page, you will understand the energy-based formulation of probability, master the Gibbs/Boltzmann distribution, explore classic EBMs like Hopfield networks and Boltzmann machines, and connect to modern deep EBMs and contrastive learning.
The Gibbs distribution (also called Boltzmann distribution) is the foundation of energy-based modeling:
$$P(\mathbf{x}) = \frac{1}{Z} \exp\left(-\frac{E(\mathbf{x})}{T}\right)$$
where:
Every MRF can be written as an EBM and vice versa:
• Potential → Energy: $E(\mathbf{x}) = -\sum_C \log \psi_C(\mathbf{x}_C)$ • Energy → Potential: $\psi(\mathbf{x}) = \exp(-E(\mathbf{x}))$
The energy view emphasizes minimization (lower is better), while the potential view emphasizes maximization (higher is better).
Temperature and Its Effects:
Temperature provides a "knob" to interpolate between random sampling and optimization, used in techniques like simulated annealing.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
import numpy as npimport matplotlib.pyplot as pltfrom typing import Callable, List class EnergyBasedModel: """ General Energy-Based Model with Gibbs distribution. """ def __init__(self, energy_func: Callable, state_space: List): self.energy = energy_func self.states = state_space self.energies = {s: energy_func(s) for s in state_space} def compute_Z(self, temperature: float = 1.0) -> float: """Compute partition function at given temperature.""" return sum(np.exp(-E / temperature) for E in self.energies.values()) def probability(self, state, temperature: float = 1.0) -> float: """Compute probability of a state.""" Z = self.compute_Z(temperature) return np.exp(-self.energy(state) / temperature) / Z def distribution(self, temperature: float = 1.0) -> dict: """Return full probability distribution.""" Z = self.compute_Z(temperature) return {s: np.exp(-E / temperature) / Z for s, E in self.energies.items()} def sample_gibbs(self, n_samples: int, temperature: float = 1.0): """Sample from the distribution.""" probs = self.distribution(temperature) states = list(probs.keys()) weights = [probs[s] for s in states] indices = np.random.choice(len(states), size=n_samples, p=weights) return [states[i] for i in indices] # Example: Simple energy landscapedef quadratic_energy(x): """E(x) = (x - 2)^2, minimum at x=2.""" return (x - 2) ** 2 states = list(range(-5, 10))model = EnergyBasedModel(quadratic_energy, states) print("Temperature Effects on Distribution:")print("=" * 50) for T in [0.5, 1.0, 2.0, 5.0]: dist = model.distribution(T) mode = max(dist, key=dist.get) entropy = -sum(p * np.log(p + 1e-10) for p in dist.values()) print(f"T={T}: mode={mode}, entropy={entropy:.2f}")The EBM framework encompasses many classical models with rich histories in physics, neuroscience, and machine learning.
The Ising Model:
The simplest non-trivial MRF, from statistical physics:
$$E(\mathbf{s}) = -\sum_{(i,j) \in E} J_{ij} s_i s_j - \sum_i h_i s_i$$
where $s_i \in {-1, +1}$ are spins, $J_{ij}$ are coupling strengths, and $h_i$ are external fields.
Hopfield Networks:
A neural model for associative memory:
$$E(\mathbf{s}) = -\frac{1}{2} \sum_{i \neq j} w_{ij} s_i s_j$$
where weights encode stored patterns. The network dynamics minimize energy, settling into stored memories as local minima.
Boltzmann Machines:
A stochastic neural network with visible (v) and hidden (h) units:
$$E(\mathbf{v}, \mathbf{h}) = -\sum_i a_i v_i - \sum_j b_j h_j - \sum_{i,j} w_{ij} v_i h_j$$
Boltzmann machines can learn complex distributions over visible units by marginalizing over hidden units. Restricted Boltzmann Machines (RBMs) limit connections for tractability.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
import numpy as np class IsingModel: """2D Ising model on a grid.""" def __init__(self, size: int, coupling: float = 1.0, field: float = 0.0): self.size = size self.J = coupling self.h = field self.spins = np.random.choice([-1, 1], size=(size, size)) def energy(self) -> float: """Compute total energy.""" E = 0.0 for i in range(self.size): for j in range(self.size): # Interaction with right and down neighbors if j < self.size - 1: E -= self.J * self.spins[i, j] * self.spins[i, j+1] if i < self.size - 1: E -= self.J * self.spins[i, j] * self.spins[i+1, j] E -= self.h * self.spins[i, j] return E def magnetization(self) -> float: """Average spin (order parameter).""" return np.mean(self.spins) def metropolis_step(self, temperature: float): """One Metropolis-Hastings update step.""" i, j = np.random.randint(0, self.size, 2) # Energy change if we flip spin at (i,j) neighbors_sum = 0 for di, dj in [(-1,0), (1,0), (0,-1), (0,1)]: ni, nj = i + di, j + dj if 0 <= ni < self.size and 0 <= nj < self.size: neighbors_sum += self.spins[ni, nj] delta_E = 2 * self.spins[i, j] * (self.J * neighbors_sum + self.h) # Accept with Metropolis probability if delta_E < 0 or np.random.random() < np.exp(-delta_E / temperature): self.spins[i, j] *= -1 def simulate(self, temperature: float, n_steps: int): """Run simulation and track observables.""" energies, mags = [], [] for step in range(n_steps): for _ in range(self.size ** 2): # Sweep self.metropolis_step(temperature) if step % 100 == 0: energies.append(self.energy()) mags.append(abs(self.magnetization())) return energies, mags # Demonstrate Ising phase transitionprint("Ising Model Simulation:")print("=" * 50) model = IsingModel(size=10, coupling=1.0) for T in [1.0, 2.27, 4.0]: # Below, near, above critical temperature model.spins = np.random.choice([-1, 1], size=(10, 10)) energies, mags = model.simulate(T, 500) print(f"T={T:.2f}: final |M|={mags[-1]:.3f}, E={energies[-1]:.1f}")Training an EBM means adjusting its energy function so that low energy is assigned to observed data and high energy to other configurations.
For log-linear EBMs with $E(\mathbf{x}; \theta) = -\theta \cdot f(\mathbf{x})$:
$$\nabla_\theta \log P(\mathbf{x}{data}) = f(\mathbf{x}{data}) - \mathbb{E}{P\theta}[f(\mathbf{X})]$$
The gradient is the difference between data features and model expectations. This is the famous positive phase minus negative phase update.
The Contrastive Divergence Algorithm:
Computing the model expectation exactly requires sampling from the model—expensive! Contrastive Divergence (CD) approximates it:
CD-1 (one step) is fast but biased. More steps reduce bias.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
class ContrastiveDivergence: """ Contrastive Divergence learning for Restricted Boltzmann Machines. """ def __init__(self, n_visible: int, n_hidden: int): self.n_v = n_visible self.n_h = n_hidden # Initialize weights self.W = np.random.randn(n_visible, n_hidden) * 0.01 self.a = np.zeros(n_visible) # Visible biases self.b = np.zeros(n_hidden) # Hidden biases def sigmoid(self, x): return 1 / (1 + np.exp(-np.clip(x, -500, 500))) def sample_hidden(self, v: np.ndarray) -> np.ndarray: """P(h|v) for RBM.""" prob_h = self.sigmoid(v @ self.W + self.b) return (np.random.random(self.n_h) < prob_h).astype(float), prob_h def sample_visible(self, h: np.ndarray) -> np.ndarray: """P(v|h) for RBM.""" prob_v = self.sigmoid(h @ self.W.T + self.a) return (np.random.random(self.n_v) < prob_v).astype(float), prob_v def cd_step(self, v_data: np.ndarray, k: int = 1, lr: float = 0.01): """ One step of CD-k learning. """ # Positive phase: data -> hidden h_data, prob_h_data = self.sample_hidden(v_data) positive_grad = np.outer(v_data, prob_h_data) # Negative phase: k steps of Gibbs sampling v_sample = v_data.copy() for _ in range(k): h_sample, _ = self.sample_hidden(v_sample) v_sample, prob_v_sample = self.sample_visible(h_sample) _, prob_h_model = self.sample_hidden(v_sample) negative_grad = np.outer(v_sample, prob_h_model) # Update weights self.W += lr * (positive_grad - negative_grad) self.a += lr * (v_data - v_sample) self.b += lr * (prob_h_data - prob_h_model) def energy(self, v: np.ndarray) -> float: """Compute free energy F(v) = -log Σ_h exp(-E(v,h)).""" hidden_term = np.sum(np.log(1 + np.exp(v @ self.W + self.b))) return -np.dot(v, self.a) - hidden_term # Demoprint("\nContrastive Divergence Training:")print("=" * 50) rbm = ContrastiveDivergence(n_visible=6, n_hidden=3) # Synthetic data: two patternspatterns = [ np.array([1, 1, 1, 0, 0, 0]), np.array([0, 0, 0, 1, 1, 1])] for epoch in range(100): for pattern in patterns: rbm.cd_step(pattern, k=1, lr=0.1) print("Learned energies:")for p in patterns: print(f" Pattern {p}: E = {rbm.energy(p):.2f}")print(f" Random {np.array([1,0,1,0,1,0])}: E = {rbm.energy(np.array([1,0,1,0,1,0])):.2f}")Recent years have seen a renaissance of EBMs, now with deep neural networks parameterizing the energy function.
The score $\nabla_\mathbf{x} \log p(\mathbf{x}) = -\nabla_\mathbf{x} E(\mathbf{x})$ doesn't depend on $Z$! Score matching and denoising score matching train EBMs without computing the partition function, enabling modern deep generative models.
Noise Contrastive Estimation:
NCE sidesteps partition function computation by converting density estimation to classification:
This principle underlies many modern contrastive learning methods.
| Domain | Application | Energy Function Design |
|---|---|---|
| Computer Vision | Image denoising, segmentation | Pairwise smoothness + data fidelity |
| Natural Language | Language modeling, parsing | Feature-based log-linear models |
| Computational Biology | Protein structure prediction | Physical force fields + statistical potentials |
| Robotics | Imitation learning | Energy landscapes over trajectories |
| Recommendation | Preference modeling | User-item interaction energies |
The EBM framework unifies diverse problems under a common mathematical umbrella: define an energy function capturing what makes configurations good or bad, and let the Gibbs distribution handle the rest.
Congratulations! You've completed the Markov Random Fields module. You now understand undirected graphs, potential functions, the partition function, the Hammersley-Clifford theorem, and energy-based models—the complete theoretical foundation for working with MRFs.