Naive Bayes Assumption - Learning Module

Loading content...

0/245

When the Assumption Fails

The Reality of Dependency

Having explored when conditional independence holds, we must now confront an uncomfortable truth: in most real-world datasets, the Naive Bayes assumption is violated.

Features in practical datasets are correlated, interacting, and dependent in complex ways. Words in sentences form phrases. Pixels in images form patterns. Genes in genomes participate in pathways. Financial indicators move together during market events.

Understanding when and how the assumption fails is essential for:

Knowing when Naive Bayes is likely to underperform
Understanding the specific failure modes (which are different from simply 'being wrong')
Designing mitigation strategies and knowing when to switch to alternative models
Correctly interpreting Naive Bayes outputs in the presence of violations

This page provides a rigorous examination of independence violations—their sources, mathematical characterization, and practical consequences.

What You Will Learn

By the end of this page, you will understand: (1) Common sources and types of feature dependencies; (2) How to quantify the degree of independence violation; (3) The specific effects on probability estimates and classification; (4) Domains where violations are severe enough to avoid Naive Bayes; and (5) Mathematical analysis of what 'failure' actually means for the classifier.

Taxonomy of Feature Dependencies

Feature dependencies come in many forms. Understanding the different types helps predict when Naive Bayes will struggle and suggests targeted remediation strategies.

Type 1: Direct Correlation (Redundancy)

The simplest form: two features measure essentially the same underlying quantity.

Examples:

Height in inches and height in centimeters
Temperature in Celsius and Fahrenheit
Income and net worth (partial overlap)

Effect on Naive Bayes: Double-counting of evidence. If both features support class A, their contributions are added as if they were independent observations, leading to overconfident predictions.

Mathematical characterization: $$\rho(X_i, X_j | Y) \approx 1 \quad \text{(strong positive correlation)}$$

Type 2: Hierarchical/Subset Relationships

One feature is a superset or subset of another.

Examples:

'doctor' and 'medical' in text (document containing 'doctor' often contains 'medical')
'California' and 'USA' in location data
Specific symptom and the syndrome it belongs to

Effect: Similar to redundancy but often partial—increases variance of probability estimates.

Type 3: Negative Constraints

Features that are mutually exclusive or negatively correlated.

Examples:

Being in two locations simultaneously (physically impossible)
Exclusive categorical features encoded as separate binaries
'Hot' weather and 'Cold' weather

Mathematical characterization: $$\rho(X_i, X_j | Y) \approx -1 \quad \text{(strong negative correlation)}$$

Effect: Evidence cancellation—the model under-counts evidence because it expects both features to contribute independently.

Type 4: Functional Dependencies

One feature is a deterministic function of another.

Examples:

Age and birth year (given current date)
BMI calculated from height and weight
Total price = unit price × quantity

Effect: Maximum violation. The 'information' is counted multiple times, severely distorting probabilities.

Type 5: Latent Variable Dependencies

Features are correlated because they share an unobserved common cause that isn't the class label.

Examples:

Symptoms sharing an undiagnosed comorbidity
Financial indicators affected by undisclosed market sentiment
Survey responses influenced by respondent mood

This is the most insidious type because it's often invisible and can't be fixed by observing more features.

Feature Dependency Types and Their Effects
Type	Correlation Sign	Effect on NB	Severity
Direct redundancy	Strong positive	Overconfident probabilities	High
Hierarchical subset	Moderate positive	Increased variance	Moderate
Negative constraints	Negative	Underconfident/confused	Moderate
Functional dependency	Perfect (±1)	Extreme distortion	Very high
Latent variable	Varies	Systematic bias	Moderate-High

Feature Engineering Creates Dependencies

Be aware that feature engineering often introduces dependencies. Creating polynomial features, interaction terms, or derived variables from existing features guarantees conditional dependency. Always consider the independence implications of feature engineering choices.

Quantifying Independence Violations

To move from qualitative statements ('features are dependent') to actionable analysis, we need quantitative measures of how badly independence is violated.

Measure 1: Conditional Correlation Matrix

For each class $y$, compute the correlation matrix conditioned on that class:

$$R_y = [\rho(X_i, X_j | Y = y)]_{i,j}$$

Aggregated measure: $$\bar{\rho} = \sum_y P(Y = y) \cdot \frac{1}{d(d-1)} \sum_{i \neq j} |\rho(X_i, X_j | Y = y)|$$

This gives the average absolute conditional correlation. Perfect independence: $\bar{\rho} = 0$.

Measure 2: Conditional Mutual Information

For capturing non-linear dependencies:

$$I(X_i; X_j | Y) = \sum_y P(Y = y) \sum_{x_i, x_j} P(x_i, x_j | y) \log \frac{P(x_i, x_j | y)}{P(x_i | y) P(x_j | y)}$$

Properties:

$I(X_i; X_j | Y) = 0$ if and only if conditional independence holds
$I(X_i; X_j | Y) > 0$ measures the 'amount' of dependency in bits
Bounded above by $\min(H(X_i | Y), H(X_j | Y))$

Measure 3: Kullback-Leibler Divergence from Independence

Measures how far the true joint distribution is from the Naive Bayes factorization:

$$D_{KL} = \sum_y P(Y = y) \cdot D_{KL}\left( P(X_1, \ldots, X_d | Y = y) | \prod_i P(X_i | Y = y) \right)$$

This directly measures the 'modeling error' of the Naive Bayes assumption in bits.

Measure 4: Correlation Ratio (for Mixed-Type Features)

For categorical-continuous pairs, the correlation ratio $\eta$ captures the proportion of variance explained:

$$\eta^2(X | Z) = \frac{\text{Var}(E[X | Z])}{\text{Var}(X)}$$

where $Z$ is categorical and $X$ is continuous.

measure_violations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
import numpy as np
from sklearn.metrics import mutual_info_score
from scipy.stats import spearmanr
 
def average_conditional_correlation(X, y):
    """
    Compute average absolute conditional correlation.
    
    Returns a single number summarizing independence violation.
    """
    classes = np.unique(y)
    n_features = X.shape[1]
    
    total_corr = 0.0
    total_pairs = 0
    
    for c in classes:
        X_c = X[y == c]
        class_weight = len(X_c) / len(X)
        
        if len(X_c) < 3:  # Need at least 3 samples
            continue
            
        # Compute correlation matrix for this class
        corr_matrix = np.corrcoef(X_c.T)
        
        # Sum absolute off-diagonal elements
        for i in range(n_features):
            for j in range(i + 1, n_features):
                if np.isfinite(corr_matrix[i, j]):
                    total_corr += class_weight * abs(corr_matrix[i, j])
                    total_pairs += class_weight
    
    return total_corr / total_pairs if total_pairs > 0 else 0.0
 
 
def pairwise_conditional_mi(X, y, n_bins=10):
    """
    Compute pairwise conditional mutual information.
    
    Returns a matrix of I(X_i; X_j | Y) values.
    """
    X = np.array(X)
    classes = np.unique(y)
    n_features = X.shape[1]
    
    # Discretize continuous features
    X_binned = np.zeros_like(X, dtype=int)
    for i in range(n_features):
        X_binned[:, i] = np.digitize(
            X[:, i], 
            bins=np.linspace(X[:, i].min(), X[:, i].max(), n_bins)
        )
    
    # Compute CMI for each pair
    cmi_matrix = np.zeros((n_features, n_features))
    
    for i in range(n_features):
        for j in range(i + 1, n_features):
            cmi = 0.0
            for c in classes:
                mask = y == c
                class_weight = mask.sum() / len(y)
                
                X_i_c = X_binned[mask, i]
                X_j_c = X_binned[mask, j]
                
                # MI(X_i; X_j | Y=c)
                mi_c = mutual_info_score(X_i_c, X_j_c)
                cmi += class_weight * mi_c
            
            cmi_matrix[i, j] = cmi
            cmi_matrix[j, i] = cmi
    
    return cmi_matrix
 
 
def kl_divergence_from_independence(X, y, n_bins=5):
    """
    Estimate KL divergence between true joint and NB factorization.
    
    Uses histogram-based density estimation (crude but illustrative).
    """
    from scipy.stats import entropy
    
    classes = np.unique(y)
    n_features = X.shape[1]
    
    total_kl = 0.0
    
    for c in classes:
        X_c = X[y == c]
        class_weight = len(X_c) / len(X)
        
        if n_features > 3:
            # Can only do full joint for small feature sets
            print(f"Warning: {n_features} features too many for exact joint estimation")
            return None
        
        # Discretize
        X_binned = np.zeros_like(X_c, dtype=int)
        for i in range(n_features):
            X_binned[:, i] = np.digitize(
                X_c[:, i],
                bins=np.linspace(X_c[:, i].min() - 1e-10, X_c[:, i].max() + 1e-10, n_bins + 1)
            )
        
        # Estimate joint distribution (empirical)
        joint_counts = {}
        for row in X_binned:
            key = tuple(row)
            joint_counts[key] = joint_counts.get(key, 0) + 1
        
        n_samples = len(X_c)
        joint_probs = {k: v / n_samples for k, v in joint_counts.items()}
        
        # Estimate marginals
        marginal_probs = []
        for i in range(n_features):
            unique, counts = np.unique(X_binned[:, i], return_counts=True)
            marginal_probs.append({u: c / n_samples for u, c in zip(unique, counts)})
        
        # Compute KL: sum P(x) log [P(x) / Q(x)] where Q is factorized
        kl = 0.0
        for key, p in joint_probs.items():
            q = 1.0
            for i, val in enumerate(key):
                q *= marginal_probs[i].get(val, 1e-10)
            
            if p > 0 and q > 0:
                kl += p * np.log(p / q)
        
        total_kl += class_weight * kl
    
    return total_kl
 
 
# Demonstration
if __name__ == "__main__":
    np.random.seed(42)
    n = 500
    
    # Scenario 1: Independent features (NB should work well)
    X_indep = np.random.randn(n, 3)
    y_indep = (X_indep[:, 0] + X_indep[:, 1] > 0).astype(int)
    
    print("=== Independent Features ===")
    print(f"Avg conditional correlation: {average_conditional_correlation(X_indep, y_indep):.4f}")
    
    # Scenario 2: Highly correlated features (NB should struggle)
    X1 = np.random.randn(n)
    X2 = 0.9 * X1 + 0.44 * np.random.randn(n)
    X3 = 0.9 * X1 + 0.44 * np.random.randn(n)
    X_corr = np.column_stack([X1, X2, X3])
    y_corr = (X1 > 0).astype(int)
    
    print("\n=== Highly Correlated Features ===")
    print(f"Avg conditional correlation: {average_conditional_correlation(X_corr, y_corr):.4f}")
    
    # Scenario 3: Functional dependency
    X1 = np.random.randn(n)
    X2 = X1 ** 2  # Deterministic function
    X3 = np.random.randn(n)
    X_func = np.column_stack([X1, X2, X3])
    y_func = (X1 + X3 > 0).astype(int)
    
    print("\n=== Functional Dependency (X2 = X1^2) ===")
    cmi = pairwise_conditional_mi(X_func, y_func)
    print(f"CMI matrix:\n{np.round(cmi, 3)}")

Interpretation Guidelines

Conditional correlation < 0.1: Independence reasonable. 0.1-0.3: Minor violations, NB likely still works. 0.3-0.5: Moderate violations, performance degradation possible. > 0.5: Significant violations, consider alternatives. These are rough guidelines—domain and sample size matter.

Effects on Probability Estimates

When conditional independence is violated, Naive Bayes produces biased probability estimates. Understanding these biases is crucial for interpretation and calibration.

The Double-Counting Problem

Consider two perfectly correlated features $X_1$ and $X_2$ (both give identical information about class $Y$).

True probability: $$P(Y = 1 | X_1 = x, X_2 = x) = P(Y = 1 | X_1 = x) = P(Y = 1 | X_2 = x)$$

Naive Bayes estimate: $$P_{NB}(Y = 1 | X_1 = x, X_2 = x) \propto P(X_1 = x | Y = 1) \cdot P(X_2 = x | Y = 1) \cdot P(Y = 1)$$

Since both likelihoods contain the same information, NB effectively squares the evidence, leading to:

If evidence supports $Y = 1$: grossly overconfident (e.g., true 80% → estimated 98%)
If evidence against $Y = 1$: also overconfident in the other direction

Mathematical Analysis

Let's analyze the general case. Define the true likelihood ratio:

$$\text{LR}_{\text{true}}(\mathbf{x}) = \frac{P(\mathbf{x} | Y = 1)}{P(\mathbf{x} | Y = 0)}$$

And the Naive Bayes likelihood ratio:

$$\text{LR}_{NB}(\mathbf{x}) = \prod_i \frac{P(x_i | Y = 1)}{P(x_i | Y = 0)}$$

The relationship depends on the dependency structure:

Positive conditional correlation (features tend to agree): $$\text{LR}{NB}(\mathbf{x}) > \text{LR}{\text{true}}(\mathbf{x}) \text{ when } \mathbf{x} \text{ supports one class}$$

Result: Overconfident predictions.

Negative conditional correlation (features tend to disagree): $$\text{LR}{NB}(\mathbf{x}) < \text{LR}{\text{true}}(\mathbf{x}) \text{ for extreme } \mathbf{x}$$

Result: Underconfident predictions.

Calibration Curves

The signature of independence violations appears in calibration curves (reliability diagrams):

Perfect calibration: Points on the diagonal (predicted 0.8 is correct 80% of the time)
Positive dependency violations: Sigmoid-shaped curve (too confident at extremes)
Negative dependency violations: Inverse S-curve (too uncertain at extremes)

Overconfidence Pattern (Positive Correlation)

•Predicted probabilities cluster near 0 and 1
•Few predictions in the 0.3-0.7 range
•When model predicts 0.95, actual rate might be 0.80
•Calibration curve is S-shaped (below diagonal at extremes)
•Log-loss is inflated despite good accuracy
•Common with redundant features

Underconfidence Pattern (Negative Correlation)

•Predicted probabilities cluster near 0.5
•Few predictions near 0 or 1
•When model predicts 0.6, actual rate might be 0.85
•Calibration curve is inverse S-shaped (above diagonal near extremes)
•Accuracy may suffer as decisions are uncertain
•Less common than overconfidence

Calibration Fixes

Even when independence is violated, NB's probability rankings are often correct—what's wrong is the scale. Post-hoc calibration methods (Platt scaling, isotonic regression, temperature scaling) can fix the probability scale while preserving ranking quality. This is often simpler than switching models.

Effects on Classification Accuracy

Probability distortion doesn't always translate to classification errors. Understanding when violations hurt accuracy (and when they don't) is key.

When Violations DON'T Hurt Classification

1. Monotonic distortion preserves rankings:

Classification depends only on which class has the highest posterior. If the true ranking is: $$P(Y = 1 | \mathbf{x}) > P(Y = 0 | \mathbf{x})$$

And the NB estimate is: $$P_{NB}(Y = 1 | \mathbf{x}) > P_{NB}(Y = 0 | \mathbf{x})$$

Then the classification is correct even if the probabilities are wrong.

Key insight: Dependencies that uniformly inflate or deflate all class probabilities don't change the argmax.

2. Symmetric dependencies cancel:

If feature pair $(X_1, X_2)$ has +0.4 correlation and $(X_3, X_4)$ has -0.4 correlation, their effects on the posterior may partially cancel.

3. Weak signal features:

If correlated features have low discriminative power, their double-counting has minimal impact on the decision boundary.

When Violations DO Hurt Classification

1. Asymmetric dependencies across classes:

If features are positively correlated in class 1 but negatively correlated in class 0, the distortion is asymmetric. This shifts the decision boundary incorrectly.

2. Strong signal features with correlation:

When highly predictive features are correlated, double-counting can overwhelm other evidence, causing misclassification of borderline cases.

3. Small sample sizes:

With limited data, the variance increase from dependency-induced parameter coupling degrades generalization.

4. Multi-class with varied correlations:

Different correlation structures across multiple classes cause relative distortions that break rankings.

Dependency Impact AnalysisDemonstrating when correlation hurts vs. doesn't hurt classification

Input

Output

The 'Wrong but Useful' Phenomenon

Naive Bayes often produces 'wrong' probability estimates that lead to 'correct' classifications. This is because classification only requires getting the ordering right, not the exact probabilities. A model that says P(spam) = 0.99999 when the truth is P(spam) = 0.8 still correctly classifies the email as spam.

Domains Where Naive Bayes Typically Fails

Having understood the mechanics of failure, let's identify domains where independence violations are severe enough to make Naive Bayes a poor choice.

Domain 1: Computer Vision (Raw Pixels)

Classifying images using pixel values as features is a disaster for Naive Bayes.

Why it fails:

Adjacent pixels are extremely correlated (spatial structure)
Objects are defined by patterns of pixels, not individual pixel values
A cat's ear isn't a collection of independent pixel values—it's a specific configuration

Quantification: Within-class correlation between adjacent pixels is often 0.9+.

What works instead: CNNs, which explicitly model spatial structure.

Domain 2: Natural Language (Sequential)

While Naive Bayes works for bag-of-words classification, it fails for tasks requiring word order.

Why it fails:

'Not good' has opposite meaning to 'good'
'Bank account' vs 'river bank' require context
Grammar creates structural dependencies

What works instead: RNNs, Transformers, which model sequential dependencies.

Domain 3: Time Series Forecasting

Predicting future values from past observations violates independence fundamentally.

Why it fails:

Autocorrelation is the definition of time series
Today's value is highly dependent on yesterday's
Ignoring temporal structure loses the predictive signal

What works instead: ARIMA, LSTMs, Temporal Convolutional Networks.

Domain 4: Interaction-Dominated Problems

Problems where the target depends on feature combinations, not individual features.

Examples:

XOR problem (Y = X1 ⊕ X2)
Drug interactions (effect depends on combination, not individual drugs)
Genetic epistasis (phenotype from gene combinations)

Why it fails:

Factorization destroys interaction information
No individual feature predicts the target
Naive Bayes literally cannot represent the decision boundary

Domain 5: High-Correlation Feature Sets

Datasets where features are measurements of the same underlying quantity.

Examples:

Financial ratios (all derived from same balance sheet)
Physiological signals (multiple correlated biosensors)
Survey items measuring same construct

Domains to Avoid for Naive Bayes
Domain	Dependency Type	Better Alternatives
Image classification (pixels)	Spatial structure	CNN, Vision Transformers
Sequence labeling	Sequential structure	RNN, LSTM, CRF, Transformers
Time series	Temporal autocorrelation	ARIMA, Temporal models
XOR-type problems	Pure interactions	Neural networks, Decision trees
Highly redundant features	Redundancy	Regularized models, Feature selection first

Don't Force It

If your data has inherent structure (spatial, temporal, hierarchical), Naive Bayes discards that structure. You're not just accepting minor probability distortions—you're throwing away the most informative part of your data. Use models designed for structured data instead.

A Deep Dive: The XOR Problem and Fundamental Limits

The XOR problem is the canonical example of where Naive Bayes fundamentally cannot work. Understanding why illuminates the deep limitations of the independence assumption.

The XOR Problem

Setup:

Two binary features: $X_1, X_2 \in {0, 1}$
Target: $Y = X_1 \oplus X_2$ (XOR: 1 if exactly one input is 1)

$X_1$	$X_2$	$Y$
0	0	0
0	1	1
1	0	1
1	1	0

Why Naive Bayes Fails

Marginal distributions:

$$P(X_1 = 1 | Y = 1) = \frac{1}{2}, \quad P(X_1 = 0 | Y = 1) = \frac{1}{2}$$ $$P(X_2 = 1 | Y = 1) = \frac{1}{2}, \quad P(X_2 = 0 | Y = 1) = \frac{1}{2}$$

And identically for $Y = 0$!

Conclusion: Neither $X_1$ nor $X_2$ individually provides any information about $Y$. Their marginal distributions are identical regardless of $Y$.

Naive Bayes computes: $$P_{NB}(Y = 1 | X_1, X_2) \propto P(X_1 | Y = 1) \cdot P(X_2 | Y = 1) \cdot P(Y = 1) = \frac{1}{4} \cdot \frac{1}{2} = \frac{1}{8}$$

The same value for all combinations! Naive Bayes predicts $P(Y = 1) = 0.5$ for every input.

Accuracy: 50% (random guessing)

The Mathematical Insight

The XOR function has zero mutual information between any individual feature and the target:

$$I(X_1; Y) = 0, \quad I(X_2; Y) = 0$$

But the joint has full information:

$$I(X_1, X_2; Y) = 1 \text{ bit}$$

This is the interaction information—information present in the combination that's absent from the parts.

$$I(X_1; X_2; Y) = I(X_1, X_2; Y) - I(X_1; Y) - I(X_2; Y) = 1 - 0 - 0 = 1 \text{ bit}$$

Naive Bayes can only use $I(X_1; Y) + I(X_2; Y)$—it completely misses the interaction information.

The Fundamental Limit

Naive Bayes cannot represent any function where the predictive power lies entirely in feature interactions. This isn't a matter of sample size or feature engineering—it's a representational limitation. The model class simply doesn't include XOR-type functions.

Real-World XOR-Like Problems

Pure XOR is rare, but XOR-like structures appear in:

1. Genetics: Epistasis where two genes together produce an effect neither produces alone.

2. Chemistry: Catalysts that enable reactions between molecules that don't react individually.

3. Security: Authentication requiring multiple factors (password AND biometric).

4. Logic: Any system designed around combinatorial rules.

Detection and Mitigation

Detection:

Check if individual feature correlations with target are near zero
Check if adding features doesn't improve univariate-based predictions
Large gap between Naive Bayes and tree-based models suggests interactions

Mitigation:

Create explicit interaction features: $X_{1,2} = X_1 \cdot X_2$
Switch to models that capture interactions (trees, neural networks)
Use polynomial feature expansion (at the cost of dimensionality)

Detecting Problematic Dependencies

In practice, you need efficient methods to detect whether independence violations in your specific dataset will hurt Naive Bayes performance.

Method 1: Performance Gap Analysis

The most pragmatic approach:

Train Naive Bayes on training data
Train a dependency-capturing model (logistic regression, gradient boosting) on same data
Compare validation performance

Interpretation:

Gap < 2%: Independence violations are benign
Gap 2-5%: Moderate violations, NB may be acceptable depending on requirements
Gap > 5%: Significant violations, consider alternatives

Method 2: Correlation Heatmap Inspection

Visual inspection of the conditional correlation matrix:

Split data by class
Compute correlation matrix for each class
Visualize as heatmaps
Look for off-diagonal 'hot spots'

Method 3: Feature Ablation

Systematically remove features and observe impact:

Remove each feature one at a time
If removing a feature barely changes performance, it may be redundant with others
High redundancy → significant independence violation

Method 4: Calibration Analysis

Directly measure probability quality:

Compute predicted probabilities on held-out data
Bin predictions into deciles
Compute actual positive rate per bin
Plot calibration curve
Significant deviation from diagonal indicates violations

detect_nb_problems.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt
 
def performance_gap_analysis(X, y, cv=5):
    """
    Compare NB to dependency-capturing models.
    Returns performance gaps as indicators of independence violations.
    """
    models = {
        'Naive Bayes': GaussianNB(),
        'Logistic Regression': LogisticRegression(max_iter=1000),
        'Gradient Boosting': GradientBoostingClassifier(n_estimators=100)
    }
    
    results = {}
    for name, model in models.items():
        scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
        results[name] = {
            'mean': scores.mean(),
            'std': scores.std()
        }
    
    nb_score = results['Naive Bayes']['mean']
    gaps = {name: res['mean'] - nb_score 
            for name, res in results.items() if name != 'Naive Bayes'}
    
    print("=== Performance Gap Analysis ===")
    for name, score in results.items():
        print(f"{name}: {score['mean']:.3f} ± {score['std']:.3f}")
    print(f"\nGap to LR: {gaps['Logistic Regression']:.3f}")
    print(f"Gap to GB: {gaps['Gradient Boosting']:.3f}")
    
    max_gap = max(gaps.values())
    if max_gap < 0.02:
        print("\nConclusion: Independence violations appear benign")
    elif max_gap < 0.05:
        print("\nConclusion: Moderate violations - NB may be acceptable")
    else:
        print("\nConclusion: Significant violations - consider alternatives")
    
    return results, gaps
 
 
def check_calibration(X, y, n_bins=10):
    """
    Assess calibration quality as indirect evidence of violations.
    """
    from sklearn.model_selection import train_test_split
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42
    )
    
    nb = GaussianNB()
    nb.fit(X_train, y_train)
    
    # Get predicted probabilities
    probs = nb.predict_proba(X_test)[:, 1]
    
    # Compute calibration curve
    fraction_of_positives, mean_predicted_value = calibration_curve(
        y_test, probs, n_bins=n_bins
    )
    
    # Compute calibration error
    calibration_error = np.mean(np.abs(fraction_of_positives - mean_predicted_value))
    
    print(f"\n=== Calibration Analysis ===")
    print(f"Mean Calibration Error: {calibration_error:.3f}")
    
    if calibration_error < 0.05:
        print("Calibration is good - independence assumption appears valid")
    elif calibration_error < 0.1:
        print("Moderate calibration error - some independence violations likely")
    else:
        print("Poor calibration - significant independence violations")
    
    return fraction_of_positives, mean_predicted_value, calibration_error
 
 
def redundancy_check(X, y, threshold=0.8):
    """
    Identify potentially redundant feature pairs.
    """
    classes = np.unique(y)
    n_features = X.shape[1]
    
    redundant_pairs = []
    
    for c in classes:
        X_c = X[y == c]
        corr = np.corrcoef(X_c.T)
        
        for i in range(n_features):
            for j in range(i + 1, n_features):
                if abs(corr[i, j]) > threshold:
                    redundant_pairs.append((i, j, c, corr[i, j]))
    
    if redundant_pairs:
        print(f"\n=== Redundant Feature Pairs (|ρ| > {threshold}) ===")
        for i, j, c, rho in redundant_pairs:
            print(f"  Features ({i}, {j}) in class {c}: ρ = {rho:.3f}")
        print("\nConsider removing one feature from each pair.")
    else:
        print(f"\nNo highly redundant pairs found (threshold = {threshold})")
    
    return redundant_pairs
 
 
# Demo
if __name__ == "__main__":
    np.random.seed(42)
    
    # Create dataset with moderate correlation
    n = 1000
    X1 = np.random.randn(n)
    X2 = 0.7 * X1 + 0.71 * np.random.randn(n)  # Correlated with X1
    X3 = np.random.randn(n)
    X4 = np.random.randn(n)
    
    X = np.column_stack([X1, X2, X3, X4])
    y = (0.5*X1 + 0.3*X3 + 0.2*X4 > 0).astype(int)
    
    # Run diagnostics
    performance_gap_analysis(X, y)
    check_calibration(X, y)
    redundancy_check(X, y, threshold=0.6)

Practical Diagnostic Workflow

Start with performance gap analysis—if NB matches sophisticated models, stop worrying. 2. Check calibration if probability quality matters for your use case. 3. Inspect correlation structure if you want to understand the specific violations. 4. Use redundancy check to identify easy wins (removing duplicate features).

Summary: Understanding Assumption Failures

We've thoroughly examined when and how the conditional independence assumption fails. Key insights:

Key Takeaways

•Dependencies come in different types: Redundancy, hierarchical, negative constraints, functional, and latent variable dependencies each affect NB differently.
•Quantification methods include conditional correlation, mutual information, and KL divergence from the independent factorization.
•Probability estimates are distorted—typically overconfident with positive correlations, underconfident with negative correlations. Calibration curves reveal these patterns.
•Classification accuracy may be preserved even with probability distortion, because only the ranking matters. But asymmetric dependencies and interaction effects can break rankings.
•Avoid NB for structured data: Images (spatial), sequences (temporal), and interaction-dominated problems fundamentally violate the assumption in ways that can't be mitigated.
•XOR represents the fundamental limit: When all predictive power lies in interactions, NB is provably useless—it cannot represent the required function.
•Practical detection via performance gap analysis, calibration checks, and redundancy inspection helps decide whether violations are problematic for your use case.

What's next:

Despite all these potential failure modes, Naive Bayes often works surprisingly well—even when the assumption is clearly violated. The next page explores why this paradox exists and the theoretical foundations that explain NB's robust performance.

Page Complete

You now understand when and why the Naive Bayes assumption fails, how to quantify violations, and which domains to avoid. This knowledge helps you use NB appropriately and diagnose issues when they arise. Next, we'll explore the fascinating question of why NB works well despite frequent assumption violations.

When the Assumption Fails

The Reality of Dependency

Having explored when conditional independence holds, we must now confront an uncomfortable truth: in most real-world datasets, the Naive Bayes assumption is violated.

Understanding when and how the assumption fails is essential for:

Knowing when Naive Bayes is likely to underperform
Understanding the specific failure modes (which are different from simply 'being wrong')
Designing mitigation strategies and knowing when to switch to alternative models
Correctly interpreting Naive Bayes outputs in the presence of violations

This page provides a rigorous examination of independence violations—their sources, mathematical characterization, and practical consequences.

What You Will Learn

Taxonomy of Feature Dependencies

Feature dependencies come in many forms. Understanding the different types helps predict when Naive Bayes will struggle and suggests targeted remediation strategies.

Type 1: Direct Correlation (Redundancy)

The simplest form: two features measure essentially the same underlying quantity.

Examples:

Height in inches and height in centimeters
Temperature in Celsius and Fahrenheit
Income and net worth (partial overlap)

Mathematical characterization: $$\rho(X_i, X_j | Y) \approx 1 \quad \text{(strong positive correlation)}$$

Type 2: Hierarchical/Subset Relationships

One feature is a superset or subset of another.

Examples:

'doctor' and 'medical' in text (document containing 'doctor' often contains 'medical')
'California' and 'USA' in location data
Specific symptom and the syndrome it belongs to

Effect: Similar to redundancy but often partial—increases variance of probability estimates.

Type 3: Negative Constraints

Features that are mutually exclusive or negatively correlated.

Examples:

Being in two locations simultaneously (physically impossible)
Exclusive categorical features encoded as separate binaries
'Hot' weather and 'Cold' weather

Mathematical characterization: $$\rho(X_i, X_j | Y) \approx -1 \quad \text{(strong negative correlation)}$$

Effect: Evidence cancellation—the model under-counts evidence because it expects both features to contribute independently.

Type 4: Functional Dependencies

One feature is a deterministic function of another.

Examples:

Age and birth year (given current date)
BMI calculated from height and weight
Total price = unit price × quantity

Effect: Maximum violation. The 'information' is counted multiple times, severely distorting probabilities.

Type 5: Latent Variable Dependencies

Features are correlated because they share an unobserved common cause that isn't the class label.

Examples:

Symptoms sharing an undiagnosed comorbidity
Financial indicators affected by undisclosed market sentiment
Survey responses influenced by respondent mood

This is the most insidious type because it's often invisible and can't be fixed by observing more features.

Feature Dependency Types and Their Effects
Type	Correlation Sign	Effect on NB	Severity
Direct redundancy	Strong positive	Overconfident probabilities	High
Hierarchical subset	Moderate positive	Increased variance	Moderate
Negative constraints	Negative	Underconfident/confused	Moderate
Functional dependency	Perfect (±1)	Extreme distortion	Very high
Latent variable	Varies	Systematic bias	Moderate-High

Feature Engineering Creates Dependencies

Quantifying Independence Violations

To move from qualitative statements ('features are dependent') to actionable analysis, we need quantitative measures of how badly independence is violated.

Measure 1: Conditional Correlation Matrix

For each class $y$, compute the correlation matrix conditioned on that class:

$$R_y = [\rho(X_i, X_j | Y = y)]_{i,j}$$

Aggregated measure: $$\bar{\rho} = \sum_y P(Y = y) \cdot \frac{1}{d(d-1)} \sum_{i \neq j} |\rho(X_i, X_j | Y = y)|$$

This gives the average absolute conditional correlation. Perfect independence: $\bar{\rho} = 0$.

Measure 2: Conditional Mutual Information

For capturing non-linear dependencies:

$$I(X_i; X_j | Y) = \sum_y P(Y = y) \sum_{x_i, x_j} P(x_i, x_j | y) \log \frac{P(x_i, x_j | y)}{P(x_i | y) P(x_j | y)}$$

Properties:

$I(X_i; X_j | Y) = 0$ if and only if conditional independence holds
$I(X_i; X_j | Y) > 0$ measures the 'amount' of dependency in bits
Bounded above by $\min(H(X_i | Y), H(X_j | Y))$

Measure 3: Kullback-Leibler Divergence from Independence

Measures how far the true joint distribution is from the Naive Bayes factorization:

$$D_{KL} = \sum_y P(Y = y) \cdot D_{KL}\left( P(X_1, \ldots, X_d | Y = y) | \prod_i P(X_i | Y = y) \right)$$

This directly measures the 'modeling error' of the Naive Bayes assumption in bits.

Measure 4: Correlation Ratio (for Mixed-Type Features)

For categorical-continuous pairs, the correlation ratio $\eta$ captures the proportion of variance explained:

$$\eta^2(X | Z) = \frac{\text{Var}(E[X | Z])}{\text{Var}(X)}$$

where $Z$ is categorical and $X$ is continuous.

measure_violations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
import numpy as np
from sklearn.metrics import mutual_info_score
from scipy.stats import spearmanr
 
def average_conditional_correlation(X, y):
    """
    Compute average absolute conditional correlation.
    
    Returns a single number summarizing independence violation.
    """
    classes = np.unique(y)
    n_features = X.shape[1]
    
    total_corr = 0.0
    total_pairs = 0
    
    for c in classes:
        X_c = X[y == c]
        class_weight = len(X_c) / len(X)
        
        if len(X_c) < 3:  # Need at least 3 samples
            continue
            
        # Compute correlation matrix for this class
        corr_matrix = np.corrcoef(X_c.T)
        
        # Sum absolute off-diagonal elements
        for i in range(n_features):
            for j in range(i + 1, n_features):
                if np.isfinite(corr_matrix[i, j]):
                    total_corr += class_weight * abs(corr_matrix[i, j])
                    total_pairs += class_weight
    
    return total_corr / total_pairs if total_pairs > 0 else 0.0
 
 
def pairwise_conditional_mi(X, y, n_bins=10):
    """
    Compute pairwise conditional mutual information.
    
    Returns a matrix of I(X_i; X_j | Y) values.
    """
    X = np.array(X)
    classes = np.unique(y)
    n_features = X.shape[1]
    
    # Discretize continuous features
    X_binned = np.zeros_like(X, dtype=int)
    for i in range(n_features):
        X_binned[:, i] = np.digitize(
            X[:, i], 
            bins=np.linspace(X[:, i].min(), X[:, i].max(), n_bins)
        )
    
    # Compute CMI for each pair
    cmi_matrix = np.zeros((n_features, n_features))
    
    for i in range(n_features):
        for j in range(i + 1, n_features):
            cmi = 0.0
            for c in classes:
                mask = y == c
                class_weight = mask.sum() / len(y)
                
                X_i_c = X_binned[mask, i]
                X_j_c = X_binned[mask, j]
                
                # MI(X_i; X_j | Y=c)
                mi_c = mutual_info_score(X_i_c, X_j_c)
                cmi += class_weight * mi_c
            
            cmi_matrix[i, j] = cmi
            cmi_matrix[j, i] = cmi
    
    return cmi_matrix
 
 
def kl_divergence_from_independence(X, y, n_bins=5):
    """
    Estimate KL divergence between true joint and NB factorization.
    
    Uses histogram-based density estimation (crude but illustrative).
    """
    from scipy.stats import entropy
    
    classes = np.unique(y)
    n_features = X.shape[1]
    
    total_kl = 0.0
    
    for c in classes:
        X_c = X[y == c]
        class_weight = len(X_c) / len(X)
        
        if n_features > 3:
            # Can only do full joint for small feature sets
            print(f"Warning: {n_features} features too many for exact joint estimation")
            return None
        
        # Discretize
        X_binned = np.zeros_like(X_c, dtype=int)
        for i in range(n_features):
            X_binned[:, i] = np.digitize(
                X_c[:, i],
                bins=np.linspace(X_c[:, i].min() - 1e-10, X_c[:, i].max() + 1e-10, n_bins + 1)
            )
        
        # Estimate joint distribution (empirical)
        joint_counts = {}
        for row in X_binned:
            key = tuple(row)
            joint_counts[key] = joint_counts.get(key, 0) + 1
        
        n_samples = len(X_c)
        joint_probs = {k: v / n_samples for k, v in joint_counts.items()}
        
        # Estimate marginals
        marginal_probs = []
        for i in range(n_features):
            unique, counts = np.unique(X_binned[:, i], return_counts=True)
            marginal_probs.append({u: c / n_samples for u, c in zip(unique, counts)})
        
        # Compute KL: sum P(x) log [P(x) / Q(x)] where Q is factorized
        kl = 0.0
        for key, p in joint_probs.items():
            q = 1.0
            for i, val in enumerate(key):
                q *= marginal_probs[i].get(val, 1e-10)
            
            if p > 0 and q > 0:
                kl += p * np.log(p / q)
        
        total_kl += class_weight * kl
    
    return total_kl
 
 
# Demonstration
if __name__ == "__main__":
    np.random.seed(42)
    n = 500
    
    # Scenario 1: Independent features (NB should work well)
    X_indep = np.random.randn(n, 3)
    y_indep = (X_indep[:, 0] + X_indep[:, 1] > 0).astype(int)
    
    print("=== Independent Features ===")
    print(f"Avg conditional correlation: {average_conditional_correlation(X_indep, y_indep):.4f}")
    
    # Scenario 2: Highly correlated features (NB should struggle)
    X1 = np.random.randn(n)
    X2 = 0.9 * X1 + 0.44 * np.random.randn(n)
    X3 = 0.9 * X1 + 0.44 * np.random.randn(n)
    X_corr = np.column_stack([X1, X2, X3])
    y_corr = (X1 > 0).astype(int)
    
    print("\n=== Highly Correlated Features ===")
    print(f"Avg conditional correlation: {average_conditional_correlation(X_corr, y_corr):.4f}")
    
    # Scenario 3: Functional dependency
    X1 = np.random.randn(n)
    X2 = X1 ** 2  # Deterministic function
    X3 = np.random.randn(n)
    X_func = np.column_stack([X1, X2, X3])
    y_func = (X1 + X3 > 0).astype(int)
    
    print("\n=== Functional Dependency (X2 = X1^2) ===")
    cmi = pairwise_conditional_mi(X_func, y_func)
    print(f"CMI matrix:\n{np.round(cmi, 3)}")

Interpretation Guidelines

Effects on Probability Estimates

When conditional independence is violated, Naive Bayes produces biased probability estimates. Understanding these biases is crucial for interpretation and calibration.

The Double-Counting Problem

Consider two perfectly correlated features $X_1$ and $X_2$ (both give identical information about class $Y$).

True probability: $$P(Y = 1 | X_1 = x, X_2 = x) = P(Y = 1 | X_1 = x) = P(Y = 1 | X_2 = x)$$

Naive Bayes estimate: $$P_{NB}(Y = 1 | X_1 = x, X_2 = x) \propto P(X_1 = x | Y = 1) \cdot P(X_2 = x | Y = 1) \cdot P(Y = 1)$$

Since both likelihoods contain the same information, NB effectively squares the evidence, leading to:

If evidence supports $Y = 1$: grossly overconfident (e.g., true 80% → estimated 98%)
If evidence against $Y = 1$: also overconfident in the other direction

Mathematical Analysis

Let's analyze the general case. Define the true likelihood ratio:

$$\text{LR}_{\text{true}}(\mathbf{x}) = \frac{P(\mathbf{x} | Y = 1)}{P(\mathbf{x} | Y = 0)}$$

And the Naive Bayes likelihood ratio:

$$\text{LR}_{NB}(\mathbf{x}) = \prod_i \frac{P(x_i | Y = 1)}{P(x_i | Y = 0)}$$

The relationship depends on the dependency structure:

Positive conditional correlation (features tend to agree): $$\text{LR}{NB}(\mathbf{x}) > \text{LR}{\text{true}}(\mathbf{x}) \text{ when } \mathbf{x} \text{ supports one class}$$

Result: Overconfident predictions.

Negative conditional correlation (features tend to disagree): $$\text{LR}{NB}(\mathbf{x}) < \text{LR}{\text{true}}(\mathbf{x}) \text{ for extreme } \mathbf{x}$$

Result: Underconfident predictions.

Calibration Curves

The signature of independence violations appears in calibration curves (reliability diagrams):

Perfect calibration: Points on the diagonal (predicted 0.8 is correct 80% of the time)
Positive dependency violations: Sigmoid-shaped curve (too confident at extremes)
Negative dependency violations: Inverse S-curve (too uncertain at extremes)

Overconfidence Pattern (Positive Correlation)

•Predicted probabilities cluster near 0 and 1
•Few predictions in the 0.3-0.7 range
•When model predicts 0.95, actual rate might be 0.80
•Calibration curve is S-shaped (below diagonal at extremes)
•Log-loss is inflated despite good accuracy
•Common with redundant features

Underconfidence Pattern (Negative Correlation)

•Predicted probabilities cluster near 0.5
•Few predictions near 0 or 1
•When model predicts 0.6, actual rate might be 0.85
•Calibration curve is inverse S-shaped (above diagonal near extremes)
•Accuracy may suffer as decisions are uncertain
•Less common than overconfidence

Calibration Fixes

Effects on Classification Accuracy

Probability distortion doesn't always translate to classification errors. Understanding when violations hurt accuracy (and when they don't) is key.

When Violations DON'T Hurt Classification

1. Monotonic distortion preserves rankings:

Classification depends only on which class has the highest posterior. If the true ranking is: $$P(Y = 1 | \mathbf{x}) > P(Y = 0 | \mathbf{x})$$

And the NB estimate is: $$P_{NB}(Y = 1 | \mathbf{x}) > P_{NB}(Y = 0 | \mathbf{x})$$

Then the classification is correct even if the probabilities are wrong.

Key insight: Dependencies that uniformly inflate or deflate all class probabilities don't change the argmax.

2. Symmetric dependencies cancel:

If feature pair $(X_1, X_2)$ has +0.4 correlation and $(X_3, X_4)$ has -0.4 correlation, their effects on the posterior may partially cancel.

3. Weak signal features:

If correlated features have low discriminative power, their double-counting has minimal impact on the decision boundary.

When Violations DO Hurt Classification

1. Asymmetric dependencies across classes:

If features are positively correlated in class 1 but negatively correlated in class 0, the distortion is asymmetric. This shifts the decision boundary incorrectly.

2. Strong signal features with correlation:

When highly predictive features are correlated, double-counting can overwhelm other evidence, causing misclassification of borderline cases.

3. Small sample sizes:

With limited data, the variance increase from dependency-induced parameter coupling degrades generalization.

4. Multi-class with varied correlations:

Different correlation structures across multiple classes cause relative distortions that break rankings.

Dependency Impact AnalysisDemonstrating when correlation hurts vs. doesn't hurt classification

Input

Output

The 'Wrong but Useful' Phenomenon

Domains Where Naive Bayes Typically Fails

Having understood the mechanics of failure, let's identify domains where independence violations are severe enough to make Naive Bayes a poor choice.

Domain 1: Computer Vision (Raw Pixels)

Classifying images using pixel values as features is a disaster for Naive Bayes.

Why it fails:

Adjacent pixels are extremely correlated (spatial structure)
Objects are defined by patterns of pixels, not individual pixel values
A cat's ear isn't a collection of independent pixel values—it's a specific configuration

Quantification: Within-class correlation between adjacent pixels is often 0.9+.

What works instead: CNNs, which explicitly model spatial structure.

Domain 2: Natural Language (Sequential)

While Naive Bayes works for bag-of-words classification, it fails for tasks requiring word order.

Why it fails:

'Not good' has opposite meaning to 'good'
'Bank account' vs 'river bank' require context
Grammar creates structural dependencies

What works instead: RNNs, Transformers, which model sequential dependencies.

Domain 3: Time Series Forecasting

Predicting future values from past observations violates independence fundamentally.

Why it fails:

Autocorrelation is the definition of time series
Today's value is highly dependent on yesterday's
Ignoring temporal structure loses the predictive signal

What works instead: ARIMA, LSTMs, Temporal Convolutional Networks.

Domain 4: Interaction-Dominated Problems

Problems where the target depends on feature combinations, not individual features.

Examples:

XOR problem (Y = X1 ⊕ X2)
Drug interactions (effect depends on combination, not individual drugs)
Genetic epistasis (phenotype from gene combinations)

Why it fails:

Factorization destroys interaction information
No individual feature predicts the target
Naive Bayes literally cannot represent the decision boundary

Domain 5: High-Correlation Feature Sets

Datasets where features are measurements of the same underlying quantity.

Examples:

Financial ratios (all derived from same balance sheet)
Physiological signals (multiple correlated biosensors)
Survey items measuring same construct

Domains to Avoid for Naive Bayes
Domain	Dependency Type	Better Alternatives
Image classification (pixels)	Spatial structure	CNN, Vision Transformers
Sequence labeling	Sequential structure	RNN, LSTM, CRF, Transformers
Time series	Temporal autocorrelation	ARIMA, Temporal models
XOR-type problems	Pure interactions	Neural networks, Decision trees
Highly redundant features	Redundancy	Regularized models, Feature selection first

Don't Force It

A Deep Dive: The XOR Problem and Fundamental Limits

The XOR problem is the canonical example of where Naive Bayes fundamentally cannot work. Understanding why illuminates the deep limitations of the independence assumption.

The XOR Problem

Setup:

Two binary features: $X_1, X_2 \in {0, 1}$
Target: $Y = X_1 \oplus X_2$ (XOR: 1 if exactly one input is 1)

$X_1$	$X_2$	$Y$
0	0	0
0	1	1
1	0	1
1	1	0

Why Naive Bayes Fails

Marginal distributions:

$$P(X_1 = 1 | Y = 1) = \frac{1}{2}, \quad P(X_1 = 0 | Y = 1) = \frac{1}{2}$$ $$P(X_2 = 1 | Y = 1) = \frac{1}{2}, \quad P(X_2 = 0 | Y = 1) = \frac{1}{2}$$

And identically for $Y = 0$!

Conclusion: Neither $X_1$ nor $X_2$ individually provides any information about $Y$. Their marginal distributions are identical regardless of $Y$.

Naive Bayes computes: $$P_{NB}(Y = 1 | X_1, X_2) \propto P(X_1 | Y = 1) \cdot P(X_2 | Y = 1) \cdot P(Y = 1) = \frac{1}{4} \cdot \frac{1}{2} = \frac{1}{8}$$

The same value for all combinations! Naive Bayes predicts $P(Y = 1) = 0.5$ for every input.

Accuracy: 50% (random guessing)

The Mathematical Insight

The XOR function has zero mutual information between any individual feature and the target:

$$I(X_1; Y) = 0, \quad I(X_2; Y) = 0$$

But the joint has full information:

$$I(X_1, X_2; Y) = 1 \text{ bit}$$

This is the interaction information—information present in the combination that's absent from the parts.

$$I(X_1; X_2; Y) = I(X_1, X_2; Y) - I(X_1; Y) - I(X_2; Y) = 1 - 0 - 0 = 1 \text{ bit}$$

Naive Bayes can only use $I(X_1; Y) + I(X_2; Y)$—it completely misses the interaction information.

The Fundamental Limit

Real-World XOR-Like Problems

Pure XOR is rare, but XOR-like structures appear in:

1. Genetics: Epistasis where two genes together produce an effect neither produces alone.

2. Chemistry: Catalysts that enable reactions between molecules that don't react individually.

3. Security: Authentication requiring multiple factors (password AND biometric).

4. Logic: Any system designed around combinatorial rules.

Detection and Mitigation

Detection:

Check if individual feature correlations with target are near zero
Check if adding features doesn't improve univariate-based predictions
Large gap between Naive Bayes and tree-based models suggests interactions

Mitigation:

Create explicit interaction features: $X_{1,2} = X_1 \cdot X_2$
Switch to models that capture interactions (trees, neural networks)
Use polynomial feature expansion (at the cost of dimensionality)

Detecting Problematic Dependencies

In practice, you need efficient methods to detect whether independence violations in your specific dataset will hurt Naive Bayes performance.

Method 1: Performance Gap Analysis

The most pragmatic approach:

Train Naive Bayes on training data
Train a dependency-capturing model (logistic regression, gradient boosting) on same data
Compare validation performance

Interpretation:

Gap < 2%: Independence violations are benign
Gap 2-5%: Moderate violations, NB may be acceptable depending on requirements
Gap > 5%: Significant violations, consider alternatives

Method 2: Correlation Heatmap Inspection

Visual inspection of the conditional correlation matrix:

Split data by class
Compute correlation matrix for each class
Visualize as heatmaps
Look for off-diagonal 'hot spots'

Method 3: Feature Ablation

Systematically remove features and observe impact:

Remove each feature one at a time
If removing a feature barely changes performance, it may be redundant with others
High redundancy → significant independence violation

Method 4: Calibration Analysis

Directly measure probability quality:

Compute predicted probabilities on held-out data
Bin predictions into deciles
Compute actual positive rate per bin
Plot calibration curve
Significant deviation from diagonal indicates violations

detect_nb_problems.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt
 
def performance_gap_analysis(X, y, cv=5):
    """
    Compare NB to dependency-capturing models.
    Returns performance gaps as indicators of independence violations.
    """
    models = {
        'Naive Bayes': GaussianNB(),
        'Logistic Regression': LogisticRegression(max_iter=1000),
        'Gradient Boosting': GradientBoostingClassifier(n_estimators=100)
    }
    
    results = {}
    for name, model in models.items():
        scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
        results[name] = {
            'mean': scores.mean(),
            'std': scores.std()
        }
    
    nb_score = results['Naive Bayes']['mean']
    gaps = {name: res['mean'] - nb_score 
            for name, res in results.items() if name != 'Naive Bayes'}
    
    print("=== Performance Gap Analysis ===")
    for name, score in results.items():
        print(f"{name}: {score['mean']:.3f} ± {score['std']:.3f}")
    print(f"\nGap to LR: {gaps['Logistic Regression']:.3f}")
    print(f"Gap to GB: {gaps['Gradient Boosting']:.3f}")
    
    max_gap = max(gaps.values())
    if max_gap < 0.02:
        print("\nConclusion: Independence violations appear benign")
    elif max_gap < 0.05:
        print("\nConclusion: Moderate violations - NB may be acceptable")
    else:
        print("\nConclusion: Significant violations - consider alternatives")
    
    return results, gaps
 
 
def check_calibration(X, y, n_bins=10):
    """
    Assess calibration quality as indirect evidence of violations.
    """
    from sklearn.model_selection import train_test_split
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42
    )
    
    nb = GaussianNB()
    nb.fit(X_train, y_train)
    
    # Get predicted probabilities
    probs = nb.predict_proba(X_test)[:, 1]
    
    # Compute calibration curve
    fraction_of_positives, mean_predicted_value = calibration_curve(
        y_test, probs, n_bins=n_bins
    )
    
    # Compute calibration error
    calibration_error = np.mean(np.abs(fraction_of_positives - mean_predicted_value))
    
    print(f"\n=== Calibration Analysis ===")
    print(f"Mean Calibration Error: {calibration_error:.3f}")
    
    if calibration_error < 0.05:
        print("Calibration is good - independence assumption appears valid")
    elif calibration_error < 0.1:
        print("Moderate calibration error - some independence violations likely")
    else:
        print("Poor calibration - significant independence violations")
    
    return fraction_of_positives, mean_predicted_value, calibration_error
 
 
def redundancy_check(X, y, threshold=0.8):
    """
    Identify potentially redundant feature pairs.
    """
    classes = np.unique(y)
    n_features = X.shape[1]
    
    redundant_pairs = []
    
    for c in classes:
        X_c = X[y == c]
        corr = np.corrcoef(X_c.T)
        
        for i in range(n_features):
            for j in range(i + 1, n_features):
                if abs(corr[i, j]) > threshold:
                    redundant_pairs.append((i, j, c, corr[i, j]))
    
    if redundant_pairs:
        print(f"\n=== Redundant Feature Pairs (|ρ| > {threshold}) ===")
        for i, j, c, rho in redundant_pairs:
            print(f"  Features ({i}, {j}) in class {c}: ρ = {rho:.3f}")
        print("\nConsider removing one feature from each pair.")
    else:
        print(f"\nNo highly redundant pairs found (threshold = {threshold})")
    
    return redundant_pairs
 
 
# Demo
if __name__ == "__main__":
    np.random.seed(42)
    
    # Create dataset with moderate correlation
    n = 1000
    X1 = np.random.randn(n)
    X2 = 0.7 * X1 + 0.71 * np.random.randn(n)  # Correlated with X1
    X3 = np.random.randn(n)
    X4 = np.random.randn(n)
    
    X = np.column_stack([X1, X2, X3, X4])
    y = (0.5*X1 + 0.3*X3 + 0.2*X4 > 0).astype(int)
    
    # Run diagnostics
    performance_gap_analysis(X, y)
    check_calibration(X, y)
    redundancy_check(X, y, threshold=0.6)

Practical Diagnostic Workflow

Start with performance gap analysis—if NB matches sophisticated models, stop worrying. 2. Check calibration if probability quality matters for your use case. 3. Inspect correlation structure if you want to understand the specific violations. 4. Use redundancy check to identify easy wins (removing duplicate features).

Summary: Understanding Assumption Failures

We've thoroughly examined when and how the conditional independence assumption fails. Key insights:

Key Takeaways

•Dependencies come in different types: Redundancy, hierarchical, negative constraints, functional, and latent variable dependencies each affect NB differently.
•Quantification methods include conditional correlation, mutual information, and KL divergence from the independent factorization.
•Probability estimates are distorted—typically overconfident with positive correlations, underconfident with negative correlations. Calibration curves reveal these patterns.
•Classification accuracy may be preserved even with probability distortion, because only the ranking matters. But asymmetric dependencies and interaction effects can break rankings.
•Avoid NB for structured data: Images (spatial), sequences (temporal), and interaction-dominated problems fundamentally violate the assumption in ways that can't be mitigated.
•XOR represents the fundamental limit: When all predictive power lies in interactions, NB is provably useless—it cannot represent the required function.
•Practical detection via performance gap analysis, calibration checks, and redundancy inspection helps decide whether violations are problematic for your use case.

What's next:

Page Complete