Bayesian Networks - Learning Module

Loading content...

0/245

CPT Representation

The Building Blocks of Probabilistic Knowledge

If the DAG structure is the skeleton of a Bayesian Network, then Conditional Probability Tables (CPTs) are its flesh. Each CPT encodes the quantitative relationship between a variable and its parents—specifying exactly how parent configurations influence the probability distribution over the child variable.

Understanding CPTs is essential for both theoretical analysis and practical implementation. These tables are where domain knowledge becomes numerical, where expert estimates meet maximum likelihood estimates, and where the computational rubber meets the probabilistic road.

In this page, we dissect CPT representation with engineering precision: the mathematical requirements, the storage formats, the parameterization choices, and the practical considerations that arise in real-world Bayesian Network implementations. By the end, you will be able to design, implement, and optimize CPT representations for any Bayesian Network application.

What You Will Learn

By completing this page, you will: (1) Understand the mathematical structure and constraints of CPTs, (2) Master multiple representation formats (table, tree, rule-based), (3) Implement CPTs in code with proper normalization, (4) Analyze storage requirements and optimization strategies, and (5) Handle edge cases like deterministic nodes and noisy-OR.

Mathematical Foundation of CPTs

A Conditional Probability Table specifies the conditional distribution of a variable given its parents. Let us establish the formal requirements.

Definition: Let $X$ be a random variable with possible values ${x_1, x_2, \ldots, x_r}$ (where $r$ is the cardinality of $X$), and let $Pa(X) = {U_1, U_2, \ldots, U_k}$ be the parents of $X$ with cardinalities $r_1, r_2, \ldots, r_k$ respectively.

The CPT for $X$ specifies: $$P(X = x_i | Pa(X) = \mathbf{u})$$

for all values $x_i$ of $X$ and all configurations $\mathbf{u} = (u_1, u_2, \ldots, u_k)$ of the parents.

Constraints:

Non-negativity: $P(X = x_i | Pa(X) = \mathbf{u}) \geq 0$ for all $i, \mathbf{u}$
Normalization: For each parent configuration $\mathbf{u}$: $$\sum_{i=1}^{r} P(X = x_i | Pa(X) = \mathbf{u}) = 1$$

These constraints ensure each row of the CPT is a valid probability distribution.

Rows vs Columns Convention

There are two common conventions for CPT layout. In the 'row normalization' convention (used here), each row corresponds to a parent configuration and sums to 1. In the 'column normalization' convention, roles are swapped. Always verify which convention a particular software package uses!

Degrees of Freedom:

The total number of free parameters in a CPT for variable $X$ with parents $Pa(X)$:

$$\text{Parameters} = (r - 1) \times \prod_{j=1}^{k} r_j$$

The factor $(r - 1)$ accounts for normalization: given $r - 1$ probabilities, the $r$-th is determined.

Example Calculation:

Suppose $X$ is ternary (3 values) with two binary parents $U_1, U_2$:

Parent configurations: $2 \times 2 = 4$
Free parameters per configuration: $3 - 1 = 2$
Total parameters: $4 \times 2 = 8$

The full CPT would have $4 \times 3 = 12$ entries, but only 8 are free.

Special Case: Root Nodes

Variables with no parents ($Pa(X) = \emptyset$) have CPTs that are simply marginal distributions: $$P(X = x_i)$$

With $r - 1$ free parameters. These are sometimes called Prior Probability Tables.

Parameter Count Examples for Different CPT Configurations
Child States	Parent Config	Entries	Free Parameters
2 (binary)	None (root)	2	1
2 (binary)	1 binary parent	4	2
2 (binary)	2 binary parents	8	4
2 (binary)	3 binary parents	16	8
3 (ternary)	2 binary parents	12	8
5 (5-ary)	3 ternary parents	135	108

Tabular Representation

The most straightforward CPT representation is a multi-dimensional table. Each dimension corresponds to a parent variable, and the entries store the conditional probabilities for each value of the child.

Standard Table Format:

For a binary variable $W$ (Wet Grass) with binary parents $R$ (Rain) and $S$ (Sprinkler), the CPT can be written as:

R	S	P(W=wet)	P(W=dry)
rain	on	0.99	0.01
rain	off	0.90	0.10
no_rain	on	0.90	0.10
no_rain	off	0.00	1.00

Each row sums to 1.0, ensuring normalization. Note that we could store only P(W=wet) since P(W=dry) = 1 - P(W=wet).

Multi-Dimensional Array Representation:

Computationally, CPTs are typically stored as multi-dimensional arrays. For $W$ with parents $R$ and $S$:

CPT[R][S][W] = probability

Or equivalently, with the child dimension last:

CPT[R, S, :] = [P(W=wet|R,S), P(W=dry|R,S)]

cpt_representation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
import numpy as np
 
class CPT:
    """
    Conditional Probability Table representation.
    
    Supports:
    - Multi-dimensional array storage
    - Automatic normalization verification
    - Probability queries
    - Pretty printing
    """
    
    def __init__(self, variable_name, variable_states, parent_names=None, parent_states=None):
        """
        Initialize a CPT.
        
        Args:
            variable_name: Name of the child variable
            variable_states: List of possible states for the child
            parent_names: List of parent variable names (None for root)
            parent_states: Dict mapping parent name -> list of states
        """
        self.variable_name = variable_name
        self.variable_states = variable_states
        self.parent_names = parent_names or []
        self.parent_states = parent_states or {}
        
        # Compute shape of the probability table
        # Dimensions: [parent_1, parent_2, ..., parent_k, child]
        self.shape = []
        for parent in self.parent_names:
            self.shape.append(len(self.parent_states[parent]))
        self.shape.append(len(variable_states))
        
        # Initialize with uniform distribution
        self.table = np.ones(self.shape) / len(variable_states)
        
        # Create index mappings
        self._build_index_maps()
    
    def _build_index_maps(self):
        """Build mappings from state names to indices."""
        self.child_to_idx = {s: i for i, s in enumerate(self.variable_states)}
        self.parent_to_idx = {}
        for parent in self.parent_names:
            self.parent_to_idx[parent] = {
                s: i for i, s in enumerate(self.parent_states[parent])
            }
    
    def set_probabilities(self, parent_config, probabilities):
        """
        Set the probability distribution for a given parent configuration.
        
        Args:
            parent_config: Dict mapping parent name -> state, or empty for root
            probabilities: Dict mapping child state -> probability
        """
        # Build index tuple for parents
        idx = []
        for parent in self.parent_names:
            state = parent_config[parent]
            idx.append(self.parent_to_idx[parent][state])
        idx = tuple(idx) if idx else ()
        
        # Set probabilities
        probs = [probabilities.get(s, 0.0) for s in self.variable_states]
        
        # Verify normalization
        total = sum(probs)
        if not np.isclose(total, 1.0):
            raise ValueError(f"Probabilities must sum to 1.0, got {total}")
        
        if idx:
            self.table[idx] = probs
        else:
            self.table[:] = probs
    
    def get_probability(self, child_state, parent_config=None):
        """
        Get P(child=child_state | parents=parent_config).
        
        Args:
            child_state: The state of the child variable
            parent_config: Dict mapping parent name -> state
        
        Returns:
            The conditional probability
        """
        parent_config = parent_config or {}
        
        idx = []
        for parent in self.parent_names:
            state = parent_config[parent]
            idx.append(self.parent_to_idx[parent][state])
        idx.append(self.child_to_idx[child_state])
        
        return self.table[tuple(idx)]
    
    def verify_normalization(self):
        """Verify all rows sum to 1."""
        if len(self.parent_names) == 0:
            # Root node - just check the single distribution
            return np.isclose(self.table.sum(), 1.0)
        
        # Sum over child dimension (last axis)
        row_sums = self.table.sum(axis=-1)
        return np.allclose(row_sums, 1.0)
    
    def __repr__(self):
        lines = [f"CPT for P({self.variable_name} | {', '.join(self.parent_names) or '∅'})"]
        lines.append(f"  States: {self.variable_states}")
        lines.append(f"  Shape: {self.shape}")
        lines.append(f"  Parameters: {np.prod(self.shape[:-1]) * (self.shape[-1] - 1)}")
        lines.append(f"  Normalized: {self.verify_normalization()}")
        return "\n".join(lines)
 
 
# Example: Create CPT for WetGrass
print("=== CPT Representation Example ===\n")
 
cpt_w = CPT(
    variable_name="WetGrass",
    variable_states=["wet", "dry"],
    parent_names=["Rain", "Sprinkler"],
    parent_states={
        "Rain": ["rain", "no_rain"],
        "Sprinkler": ["on", "off"]
    }
)
 
# Set probabilities for each parent configuration
cpt_w.set_probabilities(
    {"Rain": "rain", "Sprinkler": "on"},
    {"wet": 0.99, "dry": 0.01}
)
cpt_w.set_probabilities(
    {"Rain": "rain", "Sprinkler": "off"},
    {"wet": 0.90, "dry": 0.10}
)
cpt_w.set_probabilities(
    {"Rain": "no_rain", "Sprinkler": "on"},
    {"wet": 0.90, "dry": 0.10}
)
cpt_w.set_probabilities(
    {"Rain": "no_rain", "Sprinkler": "off"},
    {"wet": 0.00, "dry": 1.00}
)
 
print(cpt_w)
print("\nInternal table shape:", cpt_w.table.shape)
print("Internal table:")
print(cpt_w.table)
 
# Query probabilities
print("\n=== Probability Queries ===")
print(f"P(WetGrass=wet | Rain=rain, Sprinkler=on) = "
      f"{cpt_w.get_probability('wet', {'Rain': 'rain', 'Sprinkler': 'on'})}")
print(f"P(WetGrass=dry | Rain=no_rain, Sprinkler=off) = "
      f"{cpt_w.get_probability('dry', {'Rain': 'no_rain', 'Sprinkler': 'off'})}")

Alternative CPT Representations

While tabular representation is the most general, it becomes unwieldy when variables have many parents or many states. Several alternative representations provide more compact encoding for special cases.

1. Decision Tree Representation:

When context-specific independence exists—where the distribution over the child depends on only some parents in certain contexts—a decision tree can be more compact.

Example: Consider $P(X | A, B, C)$ where:

If $A = 0$, then $X$ depends only on $B$ (not $C$)
If $A = 1$, then $X$ depends only on $C$ (not $B$)

A table would need $2^3 = 8$ rows. A decision tree needs only 4 leaves:

     A
    / \
   0   1
   |   |
   B   C
  /|   |\
 P(X) P(X) P(X) P(X)

2. Rule-Based Representation:

For variables with many zero-probability configurations, rules can be more efficient:

IF Rain=yes AND Sprinkler=on THEN P(Wet)=0.99
IF Rain=yes AND Sprinkler=off THEN P(Wet)=0.90
...
DEFAULT P(Wet)=0.0  # Implicit for unspecified cases

Context-Specific Independence

Context-specific independence (CSI) occurs when conditional independence holds for specific values of conditioning variables but not globally. For example, X ⊥ Y | Z=1 might hold, but X ⊥̸ Y | Z=0. Tree representations naturally capture CSI, while tables cannot without redundancy.

3. Noisy-OR and Noisy-AND Models:

For binary variables with many causes, the Noisy-OR model provides a compact parameterization. Instead of specifying $2^k$ parameters for $k$ binary parents, Noisy-OR uses only $k + 1$ parameters.

Noisy-OR Model:

Assume each parent $U_i$ independently has a probability $q_i$ of being 'inhibited' (failing to activate the effect even when present). Additionally, there's a 'leak' probability $q_0$ of the effect being absent even with no causes.

$$P(X = 0 | U_1, \ldots, U_k) = q_0 \prod_{i: U_i = 1} q_i$$ $$P(X = 1 | U_1, \ldots, U_k) = 1 - q_0 \prod_{i: U_i = 1} q_i$$

Intuition: The effect $X$ is absent only if ALL active causes are inhibited AND the leak doesn't trigger.

Example: Lung disease ($X$) with causes: Smoking ($U_1$), Pollution ($U_2$), Genetics ($U_3$)

$q_0 = 0.95$: 95% chance of no disease with no causes
$q_1 = 0.10$: Smoking is inhibited (doesn't cause disease) 10% of the time
$q_2 = 0.70$: Pollution is inhibited 70% of time
$q_3 = 0.40$: Genetics is inhibited 40% of time

With 3 binary parents: Noisy-OR uses 4 parameters vs. 8 for full table.

noisy_or_model.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
import numpy as np
from itertools import product
 
class NoisyOR:
    """
    Noisy-OR model for binary variables with multiple binary causes.
    
    Parameters:
    - leak_prob: P(X=0 | no causes active) = probability of no effect with no causes
    - inhibition_probs: Dict mapping parent name -> P(inhibited | cause active)
    """
    
    def __init__(self, variable_name, parent_names, leak_prob, inhibition_probs):
        self.variable_name = variable_name
        self.parent_names = parent_names
        self.leak_prob = leak_prob  # q_0
        self.inhibition_probs = inhibition_probs  # {parent: q_i}
    
    def get_probability(self, child_value, parent_config):
        """
        Compute P(X | parents) using Noisy-OR formula.
        
        Args:
            child_value: 0 or 1
            parent_config: Dict mapping parent name -> 0 or 1
        
        Returns:
            P(X=child_value | parents)
        """
        # Start with leak probability
        prob_no_effect = self.leak_prob
        
        # Multiply by inhibition probability for each active cause
        for parent, value in parent_config.items():
            if value == 1:  # Cause is active
                prob_no_effect *= self.inhibition_probs[parent]
        
        # P(X=0) = product of inhibitions; P(X=1) = 1 - P(X=0)
        if child_value == 0:
            return prob_no_effect
        else:
            return 1 - prob_no_effect
    
    def generate_full_cpt(self):
        """
        Generate the equivalent full CPT for comparison.
        
        Returns:
            Dict of (parent_config_tuple) -> [P(X=0), P(X=1)]
        """
        cpt = {}
        parent_values = list(product([0, 1], repeat=len(self.parent_names)))
        
        for values in parent_values:
            config = dict(zip(self.parent_names, values))
            p0 = self.get_probability(0, config)
            p1 = self.get_probability(1, config)
            cpt[values] = [p0, p1]
        
        return cpt
    
    def parameter_savings(self):
        """Calculate parameter savings vs full table."""
        k = len(self.parent_names)
        full_table_params = 2 ** k  # One free param per row (binary child)
        noisy_or_params = k + 1  # k inhibition probs + 1 leak
        return full_table_params, noisy_or_params
 
 
# Example: Lung Disease model
print("=== Noisy-OR Model Example ===\n")
 
noisy_or = NoisyOR(
    variable_name="LungDisease",
    parent_names=["Smoking", "Pollution", "Genetics"],
    leak_prob=0.95,  # 95% no disease without causes
    inhibition_probs={
        "Smoking": 0.10,    # Smoking very likely to cause
        "Pollution": 0.70,  # Pollution moderate cause
        "Genetics": 0.40    # Genetics significant cause
    }
)
 
# Generate full CPT
print("Generated CPT from Noisy-OR model:")
print("-" * 60)
print(f"{'Smoking':>8} {'Pollution':>10} {'Genetics':>10} | P(Disease=0) P(Disease=1)")
print("-" * 60)
 
cpt = noisy_or.generate_full_cpt()
for config, probs in sorted(cpt.items()):
    print(f"{config[0]:>8} {config[1]:>10} {config[2]:>10} | {probs[0]:.4f}       {probs[1]:.4f}")
 
# Parameter savings
full, compact = noisy_or.parameter_savings()
print(f"\nParameter comparison:")
print(f"  Full CPT: {full} parameters")
print(f"  Noisy-OR: {compact} parameters")
print(f"  Reduction: {(1 - compact/full)*100:.1f}%")
 
# For more parents, savings are dramatic
print("\nScaling with number of parents:")
for k in [3, 5, 10, 20]:
    full = 2 ** k
    compact = k + 1
    print(f"  {k} parents: Full={full:,} vs Noisy-OR={compact} ({(1-compact/full)*100:.2f}% reduction)")

Continuous and Mixed Variables

Traditional CPTs assume discrete variables with finite states. For continuous variables or mixed discrete-continuous networks, different representations are needed.

Conditional Linear Gaussian (CLG) Models:

For continuous variables with Gaussian distributions, the CPT is replaced by parameters of a conditional Gaussian:

$$P(X | Pa(X)) = \mathcal{N}(\mu(Pa(X)), \sigma^2(Pa(X)))$$

where the mean $\mu$ is a linear function of continuous parents: $$\mu = \beta_0 + \sum_{i} \beta_i U_i$$

For each discrete parent configuration, different $\beta$ coefficients and variances may apply.

Example: Height $(H)$ conditioned on Age $(A)$ and Gender $(G)$: $$P(H | A, G=\text{male}) = \mathcal{N}(100 + 3 \cdot A, 5^2)$$ $$P(H | A, G=\text{female}) = \mathcal{N}(95 + 2.8 \cdot A, 4.5^2)$$

Here, each gender has its own linear model.

Hybrid Network Constraints

In CLG networks, a continuous node cannot be a parent of a discrete node (the integral over continuous space to compute discrete probabilities doesn't factor nicely). This constraint limits network topologies when mixing discrete and continuous variables. Some extensions use discretization or special cases to relax this.

Discretization:

A common practical approach is to discretize continuous variables into bins:

Temperature: {Cold: <10°C, Moderate: 10-25°C, Hot: >25°C}
Income: {Low: <$30K, Medium: $30K-100K, High: >$100K}

This enables standard CPT representation but introduces:

Information loss (values within a bin are indistinguishable)
Boundary effects (similar values near bin edges treated differently)
Need to choose bin boundaries (often domain-specific or from data quantiles)

Kernel Density Estimation:

For fully continuous parents, non-parametric approaches model the conditional density: $$P(X | Pa(X)) \approx \sum_{j=1}^{N} w_j \cdot K_h(X - X^{(j)}, Pa(X) - Pa(X)^{(j)})$$

where $K_h$ is a kernel function (e.g., Gaussian) and the sum is over training points with similar parent values.

Neural Network CPDs:

Modern approaches parameterize conditional distributions using neural networks: $$P(X | Pa(X)) = f_\theta(Pa(X))$$

where $f_\theta$ outputs distribution parameters (e.g., mixture weights, means, variances). This allows flexible, learnable conditional distributions but sacrifices interpretability and closed-form inference.

CPT Representation Approaches for Different Variable Types
Variable Types	Representation	Pros	Cons
All discrete	Standard CPT	Exact, interpretable	Exponential in parents
Discrete + Continuous (CLG)	Conditional Gaussian	Closed-form inference	Topology constraints
Discretized continuous	Standard CPT on bins	Simple implementation	Information loss
General continuous	Kernel density / Neural	Flexible	Computational cost

Storage and Computational Optimization

Efficient storage and retrieval of CPT values is crucial for large-scale Bayesian Networks. Several optimization strategies reduce memory footprint and improve computational performance.

Sparse Representation:

When many CPT entries are zero or near-zero (common in practice), sparse storage is efficient:

Store only non-zero entries with their indices
Use hash tables for O(1) lookup
Reduces memory from O(product of cardinalities) to O(number of non-zeros)

Log-Probability Storage:

To avoid numerical underflow when multiplying many probabilities:

Store $\log P(X = x | Pa(X) = \mathbf{u})$ instead of raw probabilities
Multiply probabilities → Add log probabilities
More stable for long chains of inference

$$\log P(X_1, \ldots, X_n) = \sum_{i=1}^{n} \log P(X_i | Pa(X_i))$$

Log-Sum-Exp Trick

When summing probabilities in log-space (for marginalization): log(Σ exp(log_pᵢ)) = max_i log_pᵢ + log(Σ exp(log_pᵢ - max_i log_pᵢ)). This prevents overflow by subtracting the maximum before exponentiating. Essential for numerically stable implementations.

Index Computation:

For multi-dimensional arrays, converting parent configuration to array index efficiently:

index = 0
stride = 1
for parent in reversed(parents):
    index += parent_value_index * stride
    stride *= parent_cardinality

This computes the linear index from multi-dimensional coordinates in O(k) for k parents.

Caching Computed Factors:

During inference (especially variable elimination), intermediate factors are computed repeatedly. Caching strategies:

Memoize factor products
Store marginalized factors for reuse
Trading memory for computation time

Lazy Evaluation:

For very large CPTs, compute entries on-demand rather than pre-computing all:

Useful when only small subsets are queried
Combined with caching for frequently accessed entries
Particularly valuable for Noisy-OR and similar compact models

cpt_optimization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
import numpy as np
from scipy.special import logsumexp
 
class OptimizedCPT:
    """
    CPT with numerical optimizations for stable inference.
    """
    
    def __init__(self, variable_states, parent_cardinalities):
        """
        Initialize optimized CPT.
        
        Args:
            variable_states: Number of states for child variable
            parent_cardinalities: List of cardinalities for each parent
        """
        self.variable_states = variable_states
        self.parent_cardinalities = parent_cardinalities
        
        # Compute shape and total size
        self.shape = tuple(parent_cardinalities) + (variable_states,)
        self.size = np.prod(self.shape)
        
        # Store in log-space for numerical stability
        # Initialize with uniform (-log(variable_states))
        self.log_table = np.full(self.shape, -np.log(variable_states))
        
        # Precompute strides for fast indexing
        self._compute_strides()
    
    def _compute_strides(self):
        """Precompute strides for multi-dimensional indexing."""
        self.strides = []
        stride = 1
        for card in reversed(self.parent_cardinalities):
            self.strides.append(stride)
            stride *= card
        self.strides = list(reversed(self.strides))
    
    def parent_config_to_index(self, parent_indices):
        """Convert parent configuration to linear index."""
        return sum(idx * stride 
                   for idx, stride in zip(parent_indices, self.strides))
    
    def set_distribution(self, parent_indices, probabilities):
        """
        Set probability distribution for a parent configuration.
        Stored in log-space.
        
        Args:
            parent_indices: Tuple of parent state indices
            probabilities: Array of probabilities (must sum to 1)
        """
        probs = np.array(probabilities)
        if not np.isclose(probs.sum(), 1.0):
            raise ValueError("Probabilities must sum to 1")
        
        # Store as log probabilities
        # Handle zeros by using a very small value
        log_probs = np.where(probs > 0, np.log(probs), -1e10)
        self.log_table[parent_indices] = log_probs
    
    def get_log_probability(self, parent_indices, child_index):
        """Get log P(child | parents)."""
        return self.log_table[parent_indices + (child_index,)]
    
    def get_probability(self, parent_indices, child_index):
        """Get P(child | parents) - converts from log-space."""
        return np.exp(self.get_log_probability(parent_indices, child_index))
    
    def marginalize_child(self, parent_indices):
        """
        Compute log P(parents) = log Σ_x P(x | parents)P(x) for uniform prior.
        For a CPT, this should equal 0 (log 1) since rows sum to 1.
        """
        log_probs = self.log_table[parent_indices]
        return logsumexp(log_probs)
    
    def compute_joint_log_prob(self, assignments, cpts_and_parents):
        """
        Compute log P(x1, ..., xn) using factorization.
        
        Args:
            assignments: Dict mapping variable -> state index
            cpts_and_parents: List of (cpt, variable, parents) tuples
        
        Returns:
            Log probability of the joint assignment
        """
        log_prob = 0.0
        
        for cpt, var, parents in cpts_and_parents:
            parent_indices = tuple(assignments[p] for p in parents)
            child_index = assignments[var]
            log_prob += cpt.get_log_probability(parent_indices, child_index)
        
        return log_prob
 
 
# Demonstrate numerical stability benefits
print("=== Numerical Stability Demonstration ===\n")
 
# Create a CPT with some very small probabilities
cpt = OptimizedCPT(variable_states=3, parent_cardinalities=[2, 2])
 
# Set distributions including some very small probabilities
cpt.set_distribution((0, 0), [0.99, 0.009, 0.001])
cpt.set_distribution((0, 1), [0.001, 0.001, 0.998])
cpt.set_distribution((1, 0), [0.333, 0.334, 0.333])
cpt.set_distribution((1, 1), [0.1, 0.1, 0.8])
 
print("CPT with small probabilities:")
print("P(X=2 | parents=(0,1)):", cpt.get_probability((0, 1), 2))
print("Log P(X=2 | parents=(0,1)):", cpt.get_log_probability((0, 1), 2))
 
# Simulate multiplying many probabilities
print("\n=== Stability for Long Chains ===")
 
# Multiplying 100 probabilities of 0.9 each
n_factors = 100
single_prob = 0.9
 
# Direct multiplication - will underflow for large n
direct_product = single_prob ** n_factors
print(f"Direct: 0.9^{n_factors} = {direct_product}")
 
# Log-space computation - stable
log_product = n_factors * np.log(single_prob)
recovered = np.exp(log_product)
print(f"Log-space: exp({n_factors} * log(0.9)) = {recovered}")
 
# For even longer chains
n_factors = 1000
direct_product = single_prob ** n_factors  # Will be 0.0 due to underflow
log_product = n_factors * np.log(single_prob)
print(f"\nDirect: 0.9^{n_factors} = {direct_product} (underflow!)")
print(f"Log-space: {n_factors} * log(0.9) = {log_product:.2f} (stable)")

Deterministic Nodes and Constraints

Not all conditional relationships are probabilistic—some are deterministic. A variable $X$ is deterministic given its parents if $P(X = f(Pa(X)) | Pa(X)) = 1$ for some function $f$.

Examples of Deterministic Nodes:

Logical combinations: $AND(A, B) = 1$ iff $A = 1$ and $B = 1$
Arithmetic: $Sum = X + Y + Z$
Thresholds: $AboveLimit = 1$ iff $Value > Threshold$
Selectors: $Output = A$ if $Switch = 1$ else $B$

CPT for Deterministic Nodes:

The CPT has a special structure: each row has exactly one entry of 1.0 and all others 0.0.

A	B	P(AND=0)	P(AND=1)
0	0	1.0	0.0
0	1	1.0	0.0
1	0	1.0	0.0
1	1	0.0	1.0

Storage is wasteful—we need only store the function $f$, not the full table.

Zeros Require Care

Deterministic relationships introduce zeros in the joint distribution, which can cause issues: (1) Log of zero is undefined, requiring special handling, (2) Division by zero in inference when conditioning on 'impossible' evidence, (3) Some convergence proofs require strictly positive distributions. Use log(ε) for very small probabilities instead of true zeros.

Implementing Deterministic Nodes:

Rather than storing a full CPT, store the deterministic function:

class DeterministicNode:
    def __init__(self, function, parent_names):
        self.function = function
        self.parent_names = parent_names
    
    def get_value(self, parent_config):
        return self.function(**parent_config)
    
    def get_probability(self, child_value, parent_config):
        computed = self.get_value(parent_config)
        return 1.0 if child_value == computed else 0.0

Constraint Satisfaction:

Deterministic nodes can encode constraints:

$Valid = 1$ iff all constraints are satisfied
Condition on $Valid = 1$ to restrict the distribution to valid configurations
Useful for modeling hard constraints in probabilistic programs

Inference with Deterministic Nodes:

Deterministic relationships prune the search space during inference:

Given parent values, only one child value is possible
Reduces the effective number of configurations to sum over
Can dramatically speed up variable elimination

deterministic_nodes.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
class DeterministicNode:
    """
    A node whose value is a deterministic function of its parents.
    More efficient than storing a full CPT with 0s and 1s.
    """
    
    def __init__(self, name, parent_names, function):
        """
        Args:
            name: Node name
            parent_names: List of parent variable names
            function: Function that takes parent values and returns child value
        """
        self.name = name
        self.parent_names = parent_names
        self.function = function
    
    def compute_value(self, parent_values):
        """Compute the deterministic output given parent values."""
        if isinstance(parent_values, dict):
            args = [parent_values[p] for p in self.parent_names]
        else:
            args = parent_values
        return self.function(*args)
    
    def get_probability(self, child_value, parent_values):
        """P(child | parents) - 1 if child equals computed value, else 0."""
        computed = self.compute_value(parent_values)
        return 1.0 if child_value == computed else 0.0
    
    def get_log_probability(self, child_value, parent_values):
        """Log probability - 0 if match, -inf otherwise."""
        computed = self.compute_value(parent_values)
        return 0.0 if child_value == computed else float('-inf')
 
 
# Example: Logical AND gate
and_gate = DeterministicNode(
    name="AND",
    parent_names=["A", "B"],
    function=lambda a, b: 1 if (a == 1 and b == 1) else 0
)
 
print("=== Deterministic AND Gate ===\n")
for a in [0, 1]:
    for b in [0, 1]:
        result = and_gate.compute_value({'A': a, 'B': b})
        print(f"AND({a}, {b}) = {result}")
 
# Example: Threshold node
threshold_node = DeterministicNode(
    name="HighTemperature",
    parent_names=["Temperature"],
    function=lambda t: 1 if t > 30 else 0
)
 
print("\n=== Threshold Node (T > 30°C) ===\n")
for temp in [20, 25, 30, 35, 40]:
    is_high = threshold_node.compute_value({'Temperature': temp})
    prob = threshold_node.get_probability(1, {'Temperature': temp})
    print(f"Temperature={temp}°C: HighTemperature={is_high}, P(High=1)={prob}")
 
# Example: Selector/Multiplexer
selector = DeterministicNode(
    name="Output",
    parent_names=["Switch", "InputA", "InputB"],
    function=lambda switch, a, b: a if switch == 0 else b
)
 
print("\n=== Selector/Multiplexer ===\n")
print("Switch=0: selects InputA")
print("Switch=1: selects InputB")
for switch in [0, 1]:
    for a in [10, 20]:
        for b in [100, 200]:
            out = selector.compute_value({'Switch': switch, 'InputA': a, 'InputB': b})
            print(f"  Switch={switch}, A={a}, B={b} → Output={out}")
 
# Comparison: Memory usage
print("\n=== Memory Comparison ===\n")
 
class MemoryAnalysis:
    @staticmethod
    def cpt_memory(child_states, parent_cardinalities):
        """Bytes for full CPT (float64)."""
        entries = child_states
        for card in parent_cardinalities:
            entries *= card
        return entries * 8  # 8 bytes per float64
    
    @staticmethod
    def deterministic_memory():
        """Memory for deterministic node (just function reference)."""
        return 64  # Rough estimate for function object
 
# AND gate: 2 binary parents, 1 binary output
print("AND gate (2 binary parents):")
print(f"  Full CPT: {MemoryAnalysis.cpt_memory(2, [2, 2])} bytes")
print(f"  Deterministic: ~{MemoryAnalysis.deterministic_memory()} bytes")
 
# Larger example: 10 binary inputs
print("\n10-input AND gate:")
print(f"  Full CPT: {MemoryAnalysis.cpt_memory(2, [2]*10):,} bytes ({MemoryAnalysis.cpt_memory(2, [2]*10)/1024:.1f} KB)")
print(f"  Deterministic: ~{MemoryAnalysis.deterministic_memory()} bytes")

CPT Elicitation and Validation

When data is scarce or unavailable, CPT parameters must be elicited from domain experts. This is both an art and a science, requiring careful methodology to extract accurate probability estimates.

Elicitation Challenges:

Probability estimation is hard: Humans are notoriously poor at estimating probabilities, subject to various cognitive biases
Scale effects: Experts may think in terms of 'likely', 'unlikely', 'rare' rather than 0.7, 0.2, 0.01
Conditioning confusion: Experts may conflate P(A|B) with P(B|A) or fail to hold conditioning variables fixed
Overconfidence: Experts tend to give distributions that are too narrow, underestimating uncertainty
Effort scaling: A CPT with 64 rows requires eliciting 64 (or more) distributions

Elicitation Techniques:

Anchoring: Start with a baseline and ask for adjustments: "Compared to the average case, how much more likely is X given Y?"

Scenarios: Present concrete scenarios rather than abstract probabilities: "Out of 100 patients with symptom A and test result B, how many would have disease X?"

Betting: Frame as betting: "Would you bet on X at 3-to-1 odds given Y?" This reveals revealed preferences.

Interval estimation: Ask for ranges rather than points: "What's the lowest/highest probability you'd consider reasonable?"

Calibration training: Train experts using questions with known answers to improve subsequent estimates.

The Doubly Noisy-OR Trick

For causes with additive effects, elicit just the strength of each individual cause (how likely is effect with ONLY this cause?), then compute the full CPT using Noisy-OR. This reduces elicitation burden exponentially—from 2^k entries to k single-cause questions.

Validation Techniques:

After elicitation, CPTs should be validated before deployment:

Consistency checks:

Verify normalization (each row sums to 1)
Check monotonicity where expected (more severe cause → higher probability)
Compare symmetric cases that should have equal probabilities

Sensitivity analysis:

Identify CPT entries that most affect key outputs
Focus elicitation effort on high-sensitivity parameters
If output is insensitive to a parameter, rough estimates suffice

Cross-validation with experts:

Present scenarios not used in elicitation
Ask experts to predict model outputs
Discrepancies reveal potential errors

Historical data validation:

For retrospective cases where outcomes are known, compare model predictions to actual results
Calibration plots: Are predicted probabilities accurate?

Logical validation:

Trace specific inference chains
Verify reasoning matches expert intuition
Identify counterintuitive results that may indicate errors

Common CPT Elicitation Biases and Mitigations
Bias	Description	Mitigation
Anchoring	Over-reliance on initial value	Use multiple anchors, randomize order
Overconfidence	Distributions too narrow	Explicitly ask for extreme scenarios
Availability	Weight recent/memorable cases	Use statistical framing (X out of 100)
Base rate neglect	Ignore prior probabilities	Provide base rates explicitly
Conjunction fallacy	P(A and B) > P(A)	Verify probabilistic coherence

Summary: Mastering CPT Representation

Conditional Probability Tables are the quantitative heart of Bayesian Networks. We have developed comprehensive understanding of their structure, representation, and practical handling. Let us consolidate the key insights:

Key Takeaways

•CPTs specify P(X|Pa(X)): Each table contains conditional distributions for every parent configuration, with (r-1)×∏rⱼ free parameters.
•Tabular representation is general but expensive: Multi-dimensional arrays store all entries explicitly, with exponential growth in parent count.
•Alternative representations exploit structure: Decision trees capture context-specific independence; Noisy-OR reduces exponential to linear for causal models.
•Continuous variables require special handling: Conditional Gaussians, discretization, or neural parameterizations extend beyond discrete tables.
•Numerical stability demands log-space: Store log-probabilities to prevent underflow; use log-sum-exp for marginalization.
•Deterministic nodes are a special case: Store the function rather than a table of 0s and 1s; enables efficient inference.
•Elicitation is challenging but structured: Use scenario-based questioning, calibration training, and consistency checks to extract reliable expert estimates.

What's Next:

With structure (DAGs) and parameters (CPTs) established, the next page addresses Parameter Learning—how to estimate CPT entries from observed data. We'll explore maximum likelihood estimation, Bayesian parameter estimation, handling missing data with EM, and practical considerations for learning from real datasets.

Page Complete

You now understand Conditional Probability Tables as the quantitative foundation of Bayesian Networks. You can implement CPTs in multiple formats, optimize storage and computation, handle special cases like determinism and continuous variables, and elicit parameters from domain experts. Next, we'll learn how to estimate these parameters from data.