Loading content...
Imagine two experts examining the same patient—a radiologist reading X-rays and a pathologist analyzing blood tests. Each sees different aspects of the same underlying condition. When their independent assessments agree, confidence is high. When they disagree, it signals uncertainty worth investigating.
This intuition drives co-training, a powerful semi-supervised learning paradigm introduced by Blum and Mitchell (1998). Instead of a single classifier teaching itself (as in self-training), co-training uses two classifiers on different feature views that teach each other. This multi-view approach naturally breaks the confirmation bias that plagues single-model methods.
By the end of this page, you will understand the co-training algorithm, the critical multi-view assumption, theoretical conditions for success, view construction strategies when natural views don't exist, and modern extensions like multi-view and democratic co-training.
Co-training assumes features can be split into two conditionally independent views $X = (X^{(1)}, X^{(2)})$, each sufficient to predict the label. The algorithm trains separate classifiers on each view and has them teach each other.
Formal Setup:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
def co_training(D_L, D_U, classifier1, classifier2, max_iterations=50, p=1, n=3, pool_size=75): """ Classic Co-Training Algorithm (Blum & Mitchell, 1998) Args: D_L: Labeled data [(x1, x2, y), ...] D_U: Unlabeled data [(x1, x2), ...] classifier1, classifier2: Classifiers for each view p: Positive samples to add per iteration n: Negative samples to add per iteration pool_size: Size of unlabeled pool to sample from """ X1_L = [d[0] for d in D_L] # View 1 features X2_L = [d[1] for d in D_L] # View 2 features y_L = [d[2] for d in D_L] U = list(D_U) for iteration in range(max_iterations): # Train both classifiers on current labeled set f1 = classifier1.fit(X1_L, y_L) f2 = classifier2.fit(X2_L, y_L) if len(U) == 0: break # Sample pool from unlabeled data pool_indices = np.random.choice(len(U), min(pool_size, len(U)), replace=False) pool = [U[i] for i in pool_indices] # Classifier 1 labels for classifier 2 probs1 = f1.predict_proba([x[0] for x in pool]) # Classifier 2 labels for classifier 1 probs2 = f2.predict_proba([x[1] for x in pool]) # Select most confident predictions from each classifier added = [] for clf_probs, other_view in [(probs1, 1), (probs2, 0)]: for class_label in [0, 1]: # For binary classification n_add = p if class_label == 1 else n class_probs = clf_probs[:, class_label] top_indices = np.argsort(class_probs)[-n_add:] for idx in top_indices: if clf_probs[idx, class_label] > 0.5: # Basic threshold added.append((pool_indices[idx], class_label)) # Add pseudo-labeled samples to labeled set for idx, label in added: X1_L.append(U[idx][0]) X2_L.append(U[idx][1]) y_L.append(label) # Remove from unlabeled for idx in sorted(set(i for i, _ in added), reverse=True): U.pop(idx) print(f"Iter {iteration}: Added {len(added)}, Labeled: {len(y_L)}") return f1, f2Key Insight: Mutual Confirmation
The power of co-training comes from cross-view validation. When classifier 1 confidently predicts a label, that sample is added to classifier 2's training set. Since the views are independent, classifier 2's features provide an independent check—if the label were wrong, classifier 2 would likely struggle to fit it, limiting error propagation.
Co-training's theoretical guarantees rest on two critical assumptions:
1. Sufficiency: Each view is independently sufficient to determine the label: $$P(Y | X^{(1)}) = P(Y | X) \quad \text{and} \quad P(Y | X^{(2)}) = P(Y | X)$$
2. Conditional Independence: Given the label, the views are independent: $$P(X^{(1)}, X^{(2)} | Y) = P(X^{(1)} | Y) \cdot P(X^{(2)} | Y)$$
These assumptions are strong but enable powerful theoretical results.
| Domain | View 1 | View 2 |
|---|---|---|
| Web Page Classification | Page text content | Anchor text of incoming links |
| Video Classification | Visual frames | Audio/speech track |
| Named Entity Recognition | Word features | Context window features |
| Medical Diagnosis | Imaging (X-ray, MRI) | Lab test results |
| Document Classification | Title + abstract | Full body text |
In practice, perfect conditional independence rarely holds. Views are often correlated—page text and anchor text both mention product names. Relaxed versions of co-training can still work with 'approximately independent' views, but performance degrades as correlation increases.
Blum and Mitchell (1998) proved that under the multi-view assumptions, co-training can learn with very few labeled examples.
PAC-Learning Guarantee:
If views are sufficient and conditionally independent, and the initial classifier on each view has error $\epsilon < 0.5$, then co-training achieves arbitrarily low error using only $O(\log(1/\epsilon))$ labeled examples.
Contraction Lemma:
Define the compatibility of classifiers as the probability they agree on a random unlabeled example. Dasgupta et al. (2002) showed that co-training works by making classifiers more compatible:
$$\text{disagreement}(f_1^{(t+1)}, f_2^{(t+1)}) \leq \text{disagreement}(f_1^{(t)}, f_2^{(t)})$$
As disagreement decreases, both classifiers converge toward the true decision boundary.
123456789101112131415161718192021222324252627282930313233343536
def analyze_view_quality(X1, X2, y, test_size=0.3): """ Analyze whether two views satisfy co-training assumptions. Returns metrics for sufficiency and independence. """ from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, mutual_info_score X1_tr, X1_te, X2_tr, X2_te, y_tr, y_te = train_test_split( X1, X2, y, test_size=test_size, stratify=y ) # Sufficiency: Each view should predict label well clf1 = LogisticRegression().fit(X1_tr, y_tr) clf2 = LogisticRegression().fit(X2_tr, y_tr) acc1 = accuracy_score(y_te, clf1.predict(X1_te)) acc2 = accuracy_score(y_te, clf2.predict(X2_te)) # Independence: Residuals should be uncorrelated res1 = clf1.predict_proba(X1_te)[:, 1] - y_te res2 = clf2.predict_proba(X2_te)[:, 1] - y_te residual_corr = np.corrcoef(res1, res2)[0, 1] # Prediction agreement (lower = more diverse) pred1, pred2 = clf1.predict(X1_te), clf2.predict(X2_te) agreement = np.mean(pred1 == pred2) return { 'view1_accuracy': acc1, 'view2_accuracy': acc2, 'residual_correlation': residual_corr, 'prediction_agreement': agreement, 'suitable_for_cotraining': acc1 > 0.6 and acc2 > 0.6 and abs(residual_corr) < 0.3 }Natural multi-view data is rare. When views don't exist naturally, we can construct them artificially:
1. Random Feature Split: Randomly partition features into two disjoint sets. Simple but doesn't guarantee independence.
2. PCA-Based Split: Use first $k$ principal components as view 1, remaining as view 2. Captures orthogonal variance.
3. Learned Split (Wang & Zhou, 2010): Optimize the feature split to maximize view disagreement subject to each view being predictive.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
from sklearn.decomposition import PCAfrom sklearn.cluster import KMeansimport numpy as np class ViewConstructor: """Strategies for constructing views when natural views don't exist.""" @staticmethod def random_split(X, split_ratio=0.5, seed=42): """Randomly partition features into two views.""" np.random.seed(seed) n_features = X.shape[1] n_view1 = int(n_features * split_ratio) indices = np.random.permutation(n_features) view1_idx = indices[:n_view1] view2_idx = indices[n_view1:] return X[:, view1_idx], X[:, view2_idx] @staticmethod def pca_split(X, n_components_v1=None): """Split using PCA - first components vs residual.""" if n_components_v1 is None: n_components_v1 = X.shape[1] // 2 pca = PCA(n_components=n_components_v1) view1 = pca.fit_transform(X) # Residual after projecting onto first components reconstructed = pca.inverse_transform(view1) view2 = X - reconstructed return view1, view2 @staticmethod def domain_knowledge_split(X, feature_groups): """Split based on semantic feature groups. Args: feature_groups: Dict mapping group names to feature indices e.g., {'text': [0,1,2], 'metadata': [3,4,5]} """ groups = list(feature_groups.values()) view1 = X[:, groups[0]] view2 = X[:, np.concatenate(groups[1:])] return view1, view2Domain knowledge splits outperform random splits. If your data has semantic groups (e.g., text features vs. numerical, image vs. metadata), use those natural boundaries. The goal is views that make independent errors—random splits often share correlated errors.
Research has extended co-training in several directions:
Multi-View Learning: Extend to $k > 2$ views. Each classifier's predictions are validated by majority vote of others: $$\tilde{y}_j = \text{majority}(f_1(x_j^{(1)}), ..., f_k(x_j^{(k)}))$$
Democratic Co-Training (Zhou & Li, 2005): Use diverse classifiers on the same features rather than different views. Diversity comes from algorithm differences (e.g., SVM, Random Forest, Neural Net).
Co-Regularization: Instead of pseudo-labeling, add a regularization term encouraging classifiers to agree: $$\mathcal{L} = \mathcal{L}_1 + \mathcal{L}2 + \lambda \sum{x \in U} |f_1(x) - f_2(x)|^2$$
| Variant | Diversity Source | Best When |
|---|---|---|
| Classic Co-Training | Feature splits | Natural views exist |
| Democratic Co-Training | Algorithm diversity | Single feature set, multiple models |
| Multi-View | Multiple views (>2) | Rich multi-modal data |
| Co-Regularization | Soft agreement constraint | Want smooth optimization |
When to Choose Each:
| Scenario | Recommendation |
|---|---|
| Natural multi-view data (web, multimodal) | Co-training |
| Single view, well-calibrated model | Self-training |
| Single view, want diversity | Democratic co-training |
| Very few labeled examples | Co-training (better theoretical properties) |
What's Next:
Both self-training and co-training operate on individual samples. But what if we could propagate labels through the structure of the data—leveraging relationships between samples? This leads us to label propagation, which treats semi-supervised learning as inference on a graph.
You now understand co-training: the multi-view paradigm, theoretical foundations, view construction strategies, and modern variants. Next, we'll explore label propagation and graph-based methods.