Semi Supervised Methods - Learning Module

Loading content...

0/245

Co-Training

Learning from Multiple Perspectives

Imagine two experts examining the same patient—a radiologist reading X-rays and a pathologist analyzing blood tests. Each sees different aspects of the same underlying condition. When their independent assessments agree, confidence is high. When they disagree, it signals uncertainty worth investigating.

This intuition drives co-training, a powerful semi-supervised learning paradigm introduced by Blum and Mitchell (1998). Instead of a single classifier teaching itself (as in self-training), co-training uses two classifiers on different feature views that teach each other. This multi-view approach naturally breaks the confirmation bias that plagues single-model methods.

What You Will Learn

By the end of this page, you will understand the co-training algorithm, the critical multi-view assumption, theoretical conditions for success, view construction strategies when natural views don't exist, and modern extensions like multi-view and democratic co-training.

The Co-Training Algorithm

Co-training assumes features can be split into two conditionally independent views $X = (X^{(1)}, X^{(2)})$, each sufficient to predict the label. The algorithm trains separate classifiers on each view and has them teach each other.

Formal Setup:

Feature space splits: $X = (X^{(1)}, X^{(2)})$
Labeled data: $\mathcal{D}_L = {(x_i^{(1)}, x_i^{(2)}, y_i)}$
Unlabeled data: $\mathcal{D}_U = {(x_j^{(1)}, x_j^{(2)})}$
Two classifiers: $f_1$ (operates on view 1), $f_2$ (operates on view 2)

co_training.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
def co_training(D_L, D_U, classifier1, classifier2, 
                 max_iterations=50, p=1, n=3, pool_size=75):
    """
    Classic Co-Training Algorithm (Blum & Mitchell, 1998)
    
    Args:
        D_L: Labeled data [(x1, x2, y), ...]
        D_U: Unlabeled data [(x1, x2), ...]
        classifier1, classifier2: Classifiers for each view
        p: Positive samples to add per iteration
        n: Negative samples to add per iteration
        pool_size: Size of unlabeled pool to sample from
    """
    X1_L = [d[0] for d in D_L]  # View 1 features
    X2_L = [d[1] for d in D_L]  # View 2 features
    y_L = [d[2] for d in D_L]
    U = list(D_U)
    
    for iteration in range(max_iterations):
        # Train both classifiers on current labeled set
        f1 = classifier1.fit(X1_L, y_L)
        f2 = classifier2.fit(X2_L, y_L)
        
        if len(U) == 0:
            break
            
        # Sample pool from unlabeled data
        pool_indices = np.random.choice(len(U), min(pool_size, len(U)), replace=False)
        pool = [U[i] for i in pool_indices]
        
        # Classifier 1 labels for classifier 2
        probs1 = f1.predict_proba([x[0] for x in pool])
        # Classifier 2 labels for classifier 1
        probs2 = f2.predict_proba([x[1] for x in pool])
        
        # Select most confident predictions from each classifier
        added = []
        for clf_probs, other_view in [(probs1, 1), (probs2, 0)]:
            for class_label in [0, 1]:  # For binary classification
                n_add = p if class_label == 1 else n
                class_probs = clf_probs[:, class_label]
                top_indices = np.argsort(class_probs)[-n_add:]
                
                for idx in top_indices:
                    if clf_probs[idx, class_label] > 0.5:  # Basic threshold
                        added.append((pool_indices[idx], class_label))
        
        # Add pseudo-labeled samples to labeled set
        for idx, label in added:
            X1_L.append(U[idx][0])
            X2_L.append(U[idx][1])
            y_L.append(label)
        
        # Remove from unlabeled
        for idx in sorted(set(i for i, _ in added), reverse=True):
            U.pop(idx)
            
        print(f"Iter {iteration}: Added {len(added)}, Labeled: {len(y_L)}")
    
    return f1, f2

Key Insight: Mutual Confirmation

The power of co-training comes from cross-view validation. When classifier 1 confidently predicts a label, that sample is added to classifier 2's training set. Since the views are independent, classifier 2's features provide an independent check—if the label were wrong, classifier 2 would likely struggle to fit it, limiting error propagation.

The Multi-View Assumption

Co-training's theoretical guarantees rest on two critical assumptions:

1. Sufficiency: Each view is independently sufficient to determine the label: $$P(Y | X^{(1)}) = P(Y | X) \quad \text{and} \quad P(Y | X^{(2)}) = P(Y | X)$$

2. Conditional Independence: Given the label, the views are independent: $$P(X^{(1)}, X^{(2)} | Y) = P(X^{(1)} | Y) \cdot P(X^{(2)} | Y)$$

These assumptions are strong but enable powerful theoretical results.

Natural Multi-View Examples
Domain	View 1	View 2
Web Page Classification	Page text content	Anchor text of incoming links
Video Classification	Visual frames	Audio/speech track
Named Entity Recognition	Word features	Context window features
Medical Diagnosis	Imaging (X-ray, MRI)	Lab test results
Document Classification	Title + abstract	Full body text

When Assumptions Fail

In practice, perfect conditional independence rarely holds. Views are often correlated—page text and anchor text both mention product names. Relaxed versions of co-training can still work with 'approximately independent' views, but performance degrades as correlation increases.

Theoretical Analysis

Blum and Mitchell (1998) proved that under the multi-view assumptions, co-training can learn with very few labeled examples.

PAC-Learning Guarantee:

If views are sufficient and conditionally independent, and the initial classifier on each view has error $\epsilon < 0.5$, then co-training achieves arbitrarily low error using only $O(\log(1/\epsilon))$ labeled examples.

Contraction Lemma:

Define the compatibility of classifiers as the probability they agree on a random unlabeled example. Dasgupta et al. (2002) showed that co-training works by making classifiers more compatible:

$$\text{disagreement}(f_1^{(t+1)}, f_2^{(t+1)}) \leq \text{disagreement}(f_1^{(t)}, f_2^{(t)})$$

As disagreement decreases, both classifiers converge toward the true decision boundary.

disagreement_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def analyze_view_quality(X1, X2, y, test_size=0.3):
    """
    Analyze whether two views satisfy co-training assumptions.
    Returns metrics for sufficiency and independence.
    """
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import accuracy_score, mutual_info_score
    
    X1_tr, X1_te, X2_tr, X2_te, y_tr, y_te = train_test_split(
        X1, X2, y, test_size=test_size, stratify=y
    )
    
    # Sufficiency: Each view should predict label well
    clf1 = LogisticRegression().fit(X1_tr, y_tr)
    clf2 = LogisticRegression().fit(X2_tr, y_tr)
    
    acc1 = accuracy_score(y_te, clf1.predict(X1_te))
    acc2 = accuracy_score(y_te, clf2.predict(X2_te))
    
    # Independence: Residuals should be uncorrelated
    res1 = clf1.predict_proba(X1_te)[:, 1] - y_te
    res2 = clf2.predict_proba(X2_te)[:, 1] - y_te
    residual_corr = np.corrcoef(res1, res2)[0, 1]
    
    # Prediction agreement (lower = more diverse)
    pred1, pred2 = clf1.predict(X1_te), clf2.predict(X2_te)
    agreement = np.mean(pred1 == pred2)
    
    return {
        'view1_accuracy': acc1,
        'view2_accuracy': acc2,
        'residual_correlation': residual_corr,
        'prediction_agreement': agreement,
        'suitable_for_cotraining': acc1 > 0.6 and acc2 > 0.6 and abs(residual_corr) < 0.3
    }

View Construction Strategies

Natural multi-view data is rare. When views don't exist naturally, we can construct them artificially:

1. Random Feature Split: Randomly partition features into two disjoint sets. Simple but doesn't guarantee independence.

2. PCA-Based Split: Use first $k$ principal components as view 1, remaining as view 2. Captures orthogonal variance.

3. Learned Split (Wang & Zhou, 2010): Optimize the feature split to maximize view disagreement subject to each view being predictive.

view_construction.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import numpy as np
 
class ViewConstructor:
    """Strategies for constructing views when natural views don't exist."""
    
    @staticmethod
    def random_split(X, split_ratio=0.5, seed=42):
        """Randomly partition features into two views."""
        np.random.seed(seed)
        n_features = X.shape[1]
        n_view1 = int(n_features * split_ratio)
        
        indices = np.random.permutation(n_features)
        view1_idx = indices[:n_view1]
        view2_idx = indices[n_view1:]
        
        return X[:, view1_idx], X[:, view2_idx]
    
    @staticmethod
    def pca_split(X, n_components_v1=None):
        """Split using PCA - first components vs residual."""
        if n_components_v1 is None:
            n_components_v1 = X.shape[1] // 2
            
        pca = PCA(n_components=n_components_v1)
        view1 = pca.fit_transform(X)
        
        # Residual after projecting onto first components
        reconstructed = pca.inverse_transform(view1)
        view2 = X - reconstructed
        
        return view1, view2
    
    @staticmethod
    def domain_knowledge_split(X, feature_groups):
        """Split based on semantic feature groups.
        
        Args:
            feature_groups: Dict mapping group names to feature indices
                           e.g., {'text': [0,1,2], 'metadata': [3,4,5]}
        """
        groups = list(feature_groups.values())
        view1 = X[:, groups[0]]
        view2 = X[:, np.concatenate(groups[1:])]
        return view1, view2

Best Practice

Domain knowledge splits outperform random splits. If your data has semantic groups (e.g., text features vs. numerical, image vs. metadata), use those natural boundaries. The goal is views that make independent errors—random splits often share correlated errors.

Modern Co-Training Variants

Research has extended co-training in several directions:

Multi-View Learning: Extend to $k > 2$ views. Each classifier's predictions are validated by majority vote of others: $$\tilde{y}_j = \text{majority}(f_1(x_j^{(1)}), ..., f_k(x_j^{(k)}))$$

Democratic Co-Training (Zhou & Li, 2005): Use diverse classifiers on the same features rather than different views. Diversity comes from algorithm differences (e.g., SVM, Random Forest, Neural Net).

Co-Regularization: Instead of pseudo-labeling, add a regularization term encouraging classifiers to agree: $$\mathcal{L} = \mathcal{L}_1 + \mathcal{L}2 + \lambda \sum{x \in U} |f_1(x) - f_2(x)|^2$$

Co-Training Variant Comparison
Variant	Diversity Source	Best When
Classic Co-Training	Feature splits	Natural views exist
Democratic Co-Training	Algorithm diversity	Single feature set, multiple models
Multi-View	Multiple views (>2)	Rich multi-modal data
Co-Regularization	Soft agreement constraint	Want smooth optimization

Comparison with Self-Training

Co-Training Advantages

•Cross-validation — Errors in one view don't directly propagate to itself
•Theoretical guarantees — PAC bounds under multi-view assumptions
•Handles weak initial classifiers — Each view can be weaker than needed for self-training

Co-Training Limitations

•Requires multiple views — May not exist or be hard to construct
•Strong assumptions — Conditional independence rarely holds perfectly
•More complex — Two models to train and coordinate

When to Choose Each:

Scenario	Recommendation
Natural multi-view data (web, multimodal)	Co-training
Single view, well-calibrated model	Self-training
Single view, want diversity	Democratic co-training
Very few labeled examples	Co-training (better theoretical properties)

Summary

Key Takeaways

•Two classifiers on different views teach each other — Breaking single-model confirmation bias
•Multi-view assumptions are key — Sufficiency and conditional independence enable theoretical guarantees
•Views can be natural or constructed — Domain knowledge splits work best
•Variants handle single-view data — Democratic co-training uses algorithm diversity instead
•Agreement regularization is an alternative — Soft constraints instead of hard pseudo-labels

What's Next:

Both self-training and co-training operate on individual samples. But what if we could propagate labels through the structure of the data—leveraging relationships between samples? This leads us to label propagation, which treats semi-supervised learning as inference on a graph.

Page Complete

You now understand co-training: the multi-view paradigm, theoretical foundations, view construction strategies, and modern variants. Next, we'll explore label propagation and graph-based methods.