Machine LearningGraphical Models

Markov Random Fields

LevelAdvanced

Duration90 mins

TopicGraphical Models

4 / 5

Hammersley-Clifford Theorem

The Fundamental Theorem of Markov Random Fields

We've established that MRFs use potential functions defined over cliques to represent probability distributions. But a fundamental question remains: What is the precise relationship between graph structure and probability factorization?

The Hammersley-Clifford theorem answers this question definitively. It establishes a beautiful equivalence: a positive distribution satisfies the Markov properties with respect to a graph if and only if it factorizes as a product of potentials over the maximal cliques. This theorem is the theoretical cornerstone of undirected graphical models.

What You Will Learn

By the end of this page, you will understand the precise statement of the Hammersley-Clifford theorem, appreciate the positivity requirement and what happens without it, follow the key ideas of the proof, and understand the theorem's implications for MRF modeling and inference.

The Theorem Statement

Hammersley-Clifford Theorem

Let $P$ be a strictly positive probability distribution over random variables $\mathbf{X} = (X_1, \ldots, X_n)$, and let $G = (V, E)$ be an undirected graph with $V = {1, \ldots, n}$. Then the following are equivalent:

$P$ satisfies the Global Markov Property with respect to $G$
$P$ satisfies the Local Markov Property with respect to $G$
$P$ satisfies the Pairwise Markov Property with respect to $G$
$P$ factorizes over the maximal cliques of $G$: $P(\mathbf{x}) = \frac{1}{Z}\prod_{C \in \mathcal{C}_{max}} \psi_C(\mathbf{x}_C)$

What This Means:

The theorem provides a two-way bridge:

Independence → Factorization: If we know a distribution has certain conditional independencies (expressible via graph separation), we can represent it as a product of clique potentials.
Factorization → Independence: If we define a distribution via clique potentials, we automatically get all the conditional independencies implied by graph separation.

This equivalence justifies the entire MRF framework: modeling with potentials is principled because it directly corresponds to conditional independence assumptions.

The Positivity Requirement:

The theorem requires $P(\mathbf{x}) > 0$ for all configurations $\mathbf{x}$. This "strictly positive" or "full support" assumption is crucial:

Without positivity, distributions can have pathological independence structures
Hard constraints (zero probability configurations) require careful treatment
In practice, we often work with positive distributions or handle zeros separately

Proof Intuition and Key Ideas

The full proof of Hammersley-Clifford is technical, but the key ideas are accessible. Let's sketch the main arguments.

Direction 1: Factorization ⟹ Markov Properties

This direction is straightforward. If $P$ factorizes as: $$P(\mathbf{x}) = \frac{1}{Z}\prod_{C} \psi_C(\mathbf{x}_C)$$

Consider sets $A$, $B$ separated by $S$ in the graph. Every clique involves only variables from $A \cup S$, only from $B \cup S$, or entirely within $S$. This means we can write: $$P(\mathbf{x}) \propto f(\mathbf{x}{A \cup S}) \cdot g(\mathbf{x}{B \cup S})$$

This factorization immediately implies $A \perp!!!\perp B \mid S$.

Direction 2: Markov Properties ⟹ Factorization

This is the harder direction. The key insight uses the Möbius inversion formula and logarithms.

Define the interaction potentials: $$\log \psi_C(\mathbf{x}C) = \sum{D \subseteq C} (-1)^{|C| - |D|} \log P(\mathbf{x}D, \mathbf{x}^0{-D})$$

where $\mathbf{x}^0$ is a fixed reference configuration.

The Markov properties ensure that $\psi_C = 1$ (constant) whenever $C$ is not a clique, leaving only clique potentials in the factorization.

hammersley_clifford_verification.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
import numpy as np
from itertools import product, combinations
from typing import Dict, List, Set, Tuple
 
def verify_factorization_implies_independence(
    potentials: List[Tuple[Set[int], np.ndarray]],
    variable_domains: Dict[int, List],
    set_a: Set[int],
    set_b: Set[int],
    separator: Set[int]
) -> bool:
    """
    Verify that if P factorizes over cliques, separation implies independence.
    
    Args:
        potentials: List of (scope, potential_table) pairs
        variable_domains: Domain of each variable
        set_a, set_b: Sets to check independence between
        separator: Proposed separator
    """
    variables = sorted(variable_domains.keys())
    domains = [variable_domains[v] for v in variables]
    
    def compute_joint(assignment: Dict[int, int]) -> float:
        result = 1.0
        for scope, table in potentials:
            idx = tuple(assignment[v] for v in sorted(scope))
            result *= table[idx]
        return result
    
    # Compute full joint and marginals
    all_configs = list(product(*domains))
    joint = {}
    for vals in all_configs:
        assignment = dict(zip(variables, vals))
        joint[vals] = compute_joint(assignment)
    
    Z = sum(joint.values())
    joint = {k: v/Z for k, v in joint.items()}
    
    # Check: P(A, B | S) = P(A | S) * P(B | S) for all A, B, S values
    tolerance = 1e-10
    independence_holds = True
    
    # This would require marginalizing and checking - simplified here
    print("Verification: Checking if separation implies conditional independence")
    print(f"  Sets: A={set_a}, B={set_b}, S={separator}")
    
    return independence_holds  # Full verification would be more complex
 
 
def construct_clique_potentials_from_distribution(
    joint_probs: Dict[Tuple, float],
    cliques: List[Set[int]],
    reference_config: Tuple
) -> Dict[frozenset, Dict[Tuple, float]]:
    """
    Construct clique potentials from a joint distribution using
    the Möbius inversion approach from Hammersley-Clifford proof.
    """
    n = len(reference_config)
    potentials = {}
    
    for clique in cliques:
        clique = frozenset(clique)
        potential = {}
        
        # For each configuration of the clique
        # (Simplified - full implementation needs Möbius inversion)
        
        potentials[clique] = potential
    
    return potentials
 
 
# Demonstrate the theorem with a simple example
print("Hammersley-Clifford Theorem Demonstration:")
print("=" * 50)
 
# A simple 3-node chain: X0 - X1 - X2
# Cliques: {0,1} and {1,2}
# Independence: X0 ⊥⊥ X2 | X1
 
# Define potentials
psi_01 = np.array([[2, 1], [1, 2]])  # Prefers agreement
psi_12 = np.array([[2, 1], [1, 2]])  # Prefers agreement
 
def compute_chain_distribution():
    """Compute distribution for 3-node chain and verify independence."""
    joint = np.zeros((2, 2, 2))
    
    for x0 in range(2):
        for x1 in range(2):
            for x2 in range(2):
                joint[x0, x1, x2] = psi_01[x0, x1] * psi_12[x1, x2]
    
    Z = joint.sum()
    joint /= Z
    
    # Check X0 ⊥⊥ X2 | X1
    # P(X0, X2 | X1) should equal P(X0 | X1) * P(X2 | X1)
    
    print("\nJoint distribution P(X0, X1, X2):")
    for x0 in range(2):
        for x1 in range(2):
            for x2 in range(2):
                print(f"  P({x0},{x1},{x2}) = {joint[x0,x1,x2]:.4f}")
    
    print("\nVerifying X0 ⊥⊥ X2 | X1:")
    for x1 in range(2):
        # P(X1 = x1)
        p_x1 = joint[:, x1, :].sum()
        if p_x1 == 0:
            continue
            
        for x0 in range(2):
            for x2 in range(2):
                p_x0_x2_given_x1 = joint[x0, x1, x2] / p_x1
                p_x0_given_x1 = joint[x0, x1, :].sum() / p_x1
                p_x2_given_x1 = joint[:, x1, x2].sum() / p_x1
                product = p_x0_given_x1 * p_x2_given_x1
                
                match = "✓" if abs(p_x0_x2_given_x1 - product) < 1e-10 else "✗"
                print(f"  X1={x1}: P(X0={x0},X2={x2}|X1) = {p_x0_x2_given_x1:.4f}, "
                      f"P(X0|X1)P(X2|X1) = {product:.4f} {match}")
 
compute_chain_distribution()

The Positivity Assumption

The positivity requirement ($P(\mathbf{x}) > 0$ for all $\mathbf{x}$) is not merely technical—it's essential for the theorem to hold.

What Goes Wrong Without Positivity

Without strict positivity, a distribution can satisfy the Markov properties without admitting a clique factorization, or can factorize in ways that obscure the true independence structure. The theorem breaks down in both directions.

Classic Counterexample:

Consider three binary variables with the distribution:

$P(0,0,0) = P(1,1,1) = 0.5$
All other configurations have probability 0

This distribution satisfies: $X_0 \perp!!!\perp X_1$, $X_1 \perp!!!\perp X_2$, and $X_0 \perp!!!\perp X_2$ (marginally independent), suggesting an empty graph (no edges).

But it cannot factorize as $P(x_0)P(x_1)P(x_2)$ because the variables are deterministically related: knowing any one determines the others.

Practical Implications:

For hard constraints, use extended formalisms or smoothing
Factor graphs generalize MRFs to handle some zero-probability cases
In practice, very small non-zero values often substitute for true zeros

Maximal vs. All Cliques

The theorem states we can factorize over maximal cliques. But can we use non-maximal cliques too?

Answer: Yes, with care.

Any clique factorization can be converted to a maximal clique factorization by "absorbing" smaller clique potentials into the maximal cliques containing them:

$$\prod_C \psi_C(\mathbf{x}C) = \prod{C_{max}} \left[\prod_{C \subseteq C_{max}} \psi_C(\mathbf{x}_C)\right]$$

However, working with non-maximal cliques has advantages:

Modularity: Can define separate potentials for different aspects
Sparsity: Smaller potentials may have fewer parameters
Interpretability: Pairwise potentials are easier to understand than large clique potentials

Common Practice:

In applications like image processing, we typically use:

Unary potentials on single nodes (data terms)
Pairwise potentials on edges (smoothness terms)

Even though the graph might have larger cliques, these simpler potentials suffice for the desired model.

Implications for Modeling and Inference

The Hammersley-Clifford theorem has profound practical implications:

Key Implications

•Model specification is principled: Choosing a graph structure directly specifies conditional independence assumptions. The potential functions determine the specific distribution within that constraint.
•Graph structure enables efficient inference: Independence structure can be exploited by algorithms like variable elimination and belief propagation.
•Missing edges have meaning: An absent edge is not just a modeling choice—it's a conditional independence assertion that can be tested.
•Learning decomposes: Parameters of different clique potentials can often be learned somewhat independently.
•Model criticism is possible: We can test whether data actually satisfies the conditional independencies claimed by the graph.

Design Principle

When building an MRF, think carefully about which edges to include. Each edge creates a direct dependency; each missing edge asserts conditional independence. The Hammersley-Clifford theorem guarantees that your factorization respects exactly these assertions.

Summary: The Hammersley-Clifford Foundation

Key Takeaways

•The theorem establishes equivalence — For positive distributions, Markov properties ⟺ clique factorization.
•Graph structure encodes independence — Separation in the graph implies conditional independence in the distribution.
•Positivity is essential — Without it, the theorem fails and pathological cases arise.
•Maximal cliques suffice — But we can use smaller cliques for convenience.
•Practical modeling is principled — The theorem justifies the entire MRF framework.

What's Next:

The final page explores Energy-Based Models—a powerful perspective that connects MRFs to neural networks, contrastive learning, and modern deep generative models.

Page Complete

You now understand the Hammersley-Clifford theorem—the theoretical bedrock of Markov Random Fields that precisely connects graph structure to probability factorization.

4 / 5

Loading learning content...

Machine LearningGraphical Models

Markov Random Fields

LevelAdvanced

Duration90 mins

TopicGraphical Models

4 / 5

Hammersley-Clifford Theorem

The Fundamental Theorem of Markov Random Fields

What You Will Learn

The Theorem Statement

Hammersley-Clifford Theorem

$P$ satisfies the Global Markov Property with respect to $G$
$P$ satisfies the Local Markov Property with respect to $G$
$P$ satisfies the Pairwise Markov Property with respect to $G$
$P$ factorizes over the maximal cliques of $G$: $P(\mathbf{x}) = \frac{1}{Z}\prod_{C \in \mathcal{C}_{max}} \psi_C(\mathbf{x}_C)$

What This Means:

The theorem provides a two-way bridge:

Independence → Factorization: If we know a distribution has certain conditional independencies (expressible via graph separation), we can represent it as a product of clique potentials.
Factorization → Independence: If we define a distribution via clique potentials, we automatically get all the conditional independencies implied by graph separation.

This equivalence justifies the entire MRF framework: modeling with potentials is principled because it directly corresponds to conditional independence assumptions.

The Positivity Requirement:

The theorem requires $P(\mathbf{x}) > 0$ for all configurations $\mathbf{x}$. This "strictly positive" or "full support" assumption is crucial:

Without positivity, distributions can have pathological independence structures
Hard constraints (zero probability configurations) require careful treatment
In practice, we often work with positive distributions or handle zeros separately

Proof Intuition and Key Ideas

The full proof of Hammersley-Clifford is technical, but the key ideas are accessible. Let's sketch the main arguments.

Direction 1: Factorization ⟹ Markov Properties

This direction is straightforward. If $P$ factorizes as: $$P(\mathbf{x}) = \frac{1}{Z}\prod_{C} \psi_C(\mathbf{x}_C)$$

This factorization immediately implies $A \perp!!!\perp B \mid S$.

Direction 2: Markov Properties ⟹ Factorization

This is the harder direction. The key insight uses the Möbius inversion formula and logarithms.

Define the interaction potentials: $$\log \psi_C(\mathbf{x}C) = \sum{D \subseteq C} (-1)^{|C| - |D|} \log P(\mathbf{x}D, \mathbf{x}^0{-D})$$

where $\mathbf{x}^0$ is a fixed reference configuration.

The Markov properties ensure that $\psi_C = 1$ (constant) whenever $C$ is not a clique, leaving only clique potentials in the factorization.

hammersley_clifford_verification.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
import numpy as np
from itertools import product, combinations
from typing import Dict, List, Set, Tuple
 
def verify_factorization_implies_independence(
    potentials: List[Tuple[Set[int], np.ndarray]],
    variable_domains: Dict[int, List],
    set_a: Set[int],
    set_b: Set[int],
    separator: Set[int]
) -> bool:
    """
    Verify that if P factorizes over cliques, separation implies independence.
    
    Args:
        potentials: List of (scope, potential_table) pairs
        variable_domains: Domain of each variable
        set_a, set_b: Sets to check independence between
        separator: Proposed separator
    """
    variables = sorted(variable_domains.keys())
    domains = [variable_domains[v] for v in variables]
    
    def compute_joint(assignment: Dict[int, int]) -> float:
        result = 1.0
        for scope, table in potentials:
            idx = tuple(assignment[v] for v in sorted(scope))
            result *= table[idx]
        return result
    
    # Compute full joint and marginals
    all_configs = list(product(*domains))
    joint = {}
    for vals in all_configs:
        assignment = dict(zip(variables, vals))
        joint[vals] = compute_joint(assignment)
    
    Z = sum(joint.values())
    joint = {k: v/Z for k, v in joint.items()}
    
    # Check: P(A, B | S) = P(A | S) * P(B | S) for all A, B, S values
    tolerance = 1e-10
    independence_holds = True
    
    # This would require marginalizing and checking - simplified here
    print("Verification: Checking if separation implies conditional independence")
    print(f"  Sets: A={set_a}, B={set_b}, S={separator}")
    
    return independence_holds  # Full verification would be more complex
 
 
def construct_clique_potentials_from_distribution(
    joint_probs: Dict[Tuple, float],
    cliques: List[Set[int]],
    reference_config: Tuple
) -> Dict[frozenset, Dict[Tuple, float]]:
    """
    Construct clique potentials from a joint distribution using
    the Möbius inversion approach from Hammersley-Clifford proof.
    """
    n = len(reference_config)
    potentials = {}
    
    for clique in cliques:
        clique = frozenset(clique)
        potential = {}
        
        # For each configuration of the clique
        # (Simplified - full implementation needs Möbius inversion)
        
        potentials[clique] = potential
    
    return potentials
 
 
# Demonstrate the theorem with a simple example
print("Hammersley-Clifford Theorem Demonstration:")
print("=" * 50)
 
# A simple 3-node chain: X0 - X1 - X2
# Cliques: {0,1} and {1,2}
# Independence: X0 ⊥⊥ X2 | X1
 
# Define potentials
psi_01 = np.array([[2, 1], [1, 2]])  # Prefers agreement
psi_12 = np.array([[2, 1], [1, 2]])  # Prefers agreement
 
def compute_chain_distribution():
    """Compute distribution for 3-node chain and verify independence."""
    joint = np.zeros((2, 2, 2))
    
    for x0 in range(2):
        for x1 in range(2):
            for x2 in range(2):
                joint[x0, x1, x2] = psi_01[x0, x1] * psi_12[x1, x2]
    
    Z = joint.sum()
    joint /= Z
    
    # Check X0 ⊥⊥ X2 | X1
    # P(X0, X2 | X1) should equal P(X0 | X1) * P(X2 | X1)
    
    print("\nJoint distribution P(X0, X1, X2):")
    for x0 in range(2):
        for x1 in range(2):
            for x2 in range(2):
                print(f"  P({x0},{x1},{x2}) = {joint[x0,x1,x2]:.4f}")
    
    print("\nVerifying X0 ⊥⊥ X2 | X1:")
    for x1 in range(2):
        # P(X1 = x1)
        p_x1 = joint[:, x1, :].sum()
        if p_x1 == 0:
            continue
            
        for x0 in range(2):
            for x2 in range(2):
                p_x0_x2_given_x1 = joint[x0, x1, x2] / p_x1
                p_x0_given_x1 = joint[x0, x1, :].sum() / p_x1
                p_x2_given_x1 = joint[:, x1, x2].sum() / p_x1
                product = p_x0_given_x1 * p_x2_given_x1
                
                match = "✓" if abs(p_x0_x2_given_x1 - product) < 1e-10 else "✗"
                print(f"  X1={x1}: P(X0={x0},X2={x2}|X1) = {p_x0_x2_given_x1:.4f}, "
                      f"P(X0|X1)P(X2|X1) = {product:.4f} {match}")
 
compute_chain_distribution()

The Positivity Assumption

The positivity requirement ($P(\mathbf{x}) > 0$ for all $\mathbf{x}$) is not merely technical—it's essential for the theorem to hold.

What Goes Wrong Without Positivity

Classic Counterexample:

Consider three binary variables with the distribution:

$P(0,0,0) = P(1,1,1) = 0.5$
All other configurations have probability 0

This distribution satisfies: $X_0 \perp!!!\perp X_1$, $X_1 \perp!!!\perp X_2$, and $X_0 \perp!!!\perp X_2$ (marginally independent), suggesting an empty graph (no edges).

But it cannot factorize as $P(x_0)P(x_1)P(x_2)$ because the variables are deterministically related: knowing any one determines the others.

Practical Implications:

For hard constraints, use extended formalisms or smoothing
Factor graphs generalize MRFs to handle some zero-probability cases
In practice, very small non-zero values often substitute for true zeros

Maximal vs. All Cliques

The theorem states we can factorize over maximal cliques. But can we use non-maximal cliques too?

Answer: Yes, with care.

Any clique factorization can be converted to a maximal clique factorization by "absorbing" smaller clique potentials into the maximal cliques containing them:

$$\prod_C \psi_C(\mathbf{x}C) = \prod{C_{max}} \left[\prod_{C \subseteq C_{max}} \psi_C(\mathbf{x}_C)\right]$$

However, working with non-maximal cliques has advantages:

Modularity: Can define separate potentials for different aspects
Sparsity: Smaller potentials may have fewer parameters
Interpretability: Pairwise potentials are easier to understand than large clique potentials

Common Practice:

In applications like image processing, we typically use:

Unary potentials on single nodes (data terms)
Pairwise potentials on edges (smoothness terms)

Even though the graph might have larger cliques, these simpler potentials suffice for the desired model.

Implications for Modeling and Inference

The Hammersley-Clifford theorem has profound practical implications:

Key Implications

•Model specification is principled: Choosing a graph structure directly specifies conditional independence assumptions. The potential functions determine the specific distribution within that constraint.
•Graph structure enables efficient inference: Independence structure can be exploited by algorithms like variable elimination and belief propagation.
•Missing edges have meaning: An absent edge is not just a modeling choice—it's a conditional independence assertion that can be tested.
•Learning decomposes: Parameters of different clique potentials can often be learned somewhat independently.
•Model criticism is possible: We can test whether data actually satisfies the conditional independencies claimed by the graph.

Design Principle

Summary: The Hammersley-Clifford Foundation

Key Takeaways

•The theorem establishes equivalence — For positive distributions, Markov properties ⟺ clique factorization.
•Graph structure encodes independence — Separation in the graph implies conditional independence in the distribution.
•Positivity is essential — Without it, the theorem fails and pathological cases arise.
•Maximal cliques suffice — But we can use smaller cliques for convenience.
•Practical modeling is principled — The theorem justifies the entire MRF framework.

What's Next:

The final page explores Energy-Based Models—a powerful perspective that connects MRFs to neural networks, contrastive learning, and modern deep generative models.

Page Complete

You now understand the Hammersley-Clifford theorem—the theoretical bedrock of Markov Random Fields that precisely connects graph structure to probability factorization.

4 / 5