Loading content...
Post-pruning takes a fundamentally different approach from pre-pruning. Instead of preventing overfitting during tree construction, it allows the tree to grow fully—potentially to the point of perfect training accuracy—and then systematically removes branches that don't contribute to generalization.
This approach is sometimes called grow-and-prune or backward pruning. The most principled and widely-used post-pruning method is Cost-Complexity Pruning (CCP), also known as Minimal Cost-Complexity Pruning or Weakest Link Pruning.
Why post-prune?
The key advantage of post-pruning is avoiding the horizon effect:
This page provides deep coverage of cost-complexity pruning: the mathematical formulation, the elegant alpha sequence algorithm, practical implementation with cross-validation, and how to interpret the pruning path. You'll understand why CCP is considered the gold standard for decision tree regularization.
Cost-complexity pruning is built on an elegant optimization principle: find the subtree that best balances training error against tree complexity.
The Cost-Complexity Measure:
For a tree $T$, define the cost-complexity measure:
$$R_\alpha(T) = R(T) + \alpha \cdot |T|$$
Where:
Interpretation:
The goal is to find $T_\alpha = \arg\min_T R_\alpha(T)$ for a given $\alpha$.
α controls the bias-variance tradeoff directly. Small α → low bias, high variance (complex tree memorizes training data). Large α → high bias, low variance (simple tree underfits but generalizes). The optimal α balances these extremes.
The beauty of cost-complexity pruning lies in an efficient algorithm that computes the optimal subtree for all values of $\alpha$ simultaneously.
Key Insight: Nested Optimal Subtrees
As $\alpha$ increases from 0, the optimal subtrees form a nested sequence:
$$T_{\max} = T_0 \supseteq T_1 \supseteq T_2 \supseteq \cdots \supseteq T_K = {\text{root}}$$
Each tree $T_{i+1}$ is obtained by pruning precisely one weakest link from $T_i$.
The Effective Alpha:
For any internal node $t$ with subtree $T_t$, the effective alpha is:
$$\alpha_{\text{eff}}(t) = \frac{R(t) - R(T_t)}{|T_t| - 1}$$
Interpretation:
The weakest link is the node with the smallest $\alpha_{\text{eff}}$ — the node whose subtree provides the least error reduction per complexity unit.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
def compute_effective_alpha(node, tree): """ Compute effective alpha for pruning a subtree. This is the alpha value at which we become indifferent between keeping the subtree and collapsing it to a leaf. """ if node.is_leaf: return float('inf') # Leaves cannot be pruned # R(t): error if we replace subtree with leaf error_as_leaf = compute_node_impurity(node) * node.n_samples # R(T_t): total error of the subtree subtree_error = sum( compute_node_impurity(leaf) * leaf.n_samples for leaf in get_leaves(node) ) # |T_t|: number of leaves in subtree n_leaves = count_leaves(node) # Effective alpha: error increase per leaf reduction alpha_eff = (error_as_leaf - subtree_error) / (n_leaves - 1) return alpha_eff def weakest_link_pruning(tree): """ Generate the complete pruning sequence. Returns list of (alpha, pruned_tree) pairs representing all optimal subtrees as alpha increases. """ pruning_path = [(0.0, tree.copy())] current_tree = tree.copy() while not current_tree.is_single_node(): # Find node with minimum effective alpha min_alpha = float('inf') weakest_node = None for node in get_internal_nodes(current_tree): alpha = compute_effective_alpha(node, current_tree) if alpha < min_alpha: min_alpha = alpha weakest_node = node # Prune the weakest link prune_subtree(current_tree, weakest_node) pruning_path.append((min_alpha, current_tree.copy())) return pruning_pathThe weakest link algorithm produces a pruning path: a sequence of subtrees and their corresponding threshold alpha values. This path is fundamental to both understanding and selecting the final pruned tree.
Properties of the Pruning Path:
Visualizing the Path:
The pruning path can be visualized in multiple ways:
12345678910111213141516171819202122232425262728293031323334353637
from sklearn.tree import DecisionTreeClassifierimport numpy as np def analyze_pruning_path(X, y): """ Extract and analyze the cost-complexity pruning path. """ # Fit a full tree clf = DecisionTreeClassifier(random_state=42) clf.fit(X, y) # Get the pruning path path = clf.cost_complexity_pruning_path(X, y) ccp_alphas = path.ccp_alphas # Alpha values impurities = path.impurities # Total impurity at each alpha # Analyze trees along the path results = [] for alpha in ccp_alphas: tree = DecisionTreeClassifier( ccp_alpha=alpha, random_state=42 ) tree.fit(X, y) results.append({ 'alpha': alpha, 'n_leaves': tree.get_n_leaves(), 'depth': tree.get_depth(), 'train_score': tree.score(X, y) }) return ccp_alphas, impurities, results # Key insight: As alpha increases:# - n_leaves: decreases monotonically# - train_score: decreases monotonically# - test_score: increases then decreases (sweet spot!)The 'impurities' in scikit-learn's pruning path represent total weighted impurity (sum over all leaves of impurity × n_samples). This increases monotonically with alpha, as simpler trees have higher impurity. The impurity at alpha=0 represents the full tree's training impurity.
The pruning path gives us all optimal subtrees, but we still need to select the best $\alpha$ for our specific problem. Cross-validation is the standard approach.
The Cross-Validation Procedure:
Common Selection Rules:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
from sklearn.tree import DecisionTreeClassifierfrom sklearn.model_selection import cross_val_scoreimport numpy as np def select_alpha_cv(X, y, cv=5): """ Select optimal ccp_alpha using cross-validation. Implements both max rule and 1-SE rule. """ # Get pruning path from full tree clf = DecisionTreeClassifier(random_state=42) path = clf.cost_complexity_pruning_path(X, y) ccp_alphas = path.ccp_alphas # Skip very large alphas that produce trivial trees ccp_alphas = ccp_alphas[:-1] # Cross-validate each alpha cv_scores = [] for alpha in ccp_alphas: tree = DecisionTreeClassifier( ccp_alpha=alpha, random_state=42 ) scores = cross_val_score(tree, X, y, cv=cv) cv_scores.append({ 'alpha': alpha, 'mean': scores.mean(), 'std': scores.std(), 'se': scores.std() / np.sqrt(cv) }) # Max rule: best mean CV score best_idx_max = np.argmax([s['mean'] for s in cv_scores]) alpha_max = cv_scores[best_idx_max]['alpha'] # 1-SE rule: largest alpha within 1 SE of max max_score = cv_scores[best_idx_max]['mean'] max_se = cv_scores[best_idx_max]['se'] threshold = max_score - max_se alpha_1se = alpha_max for result in cv_scores: if result['mean'] >= threshold: alpha_1se = result['alpha'] return { 'alpha_max_rule': alpha_max, 'alpha_1se_rule': alpha_1se, 'cv_results': cv_scores }The 1-standard-error rule, popularized by Breiman, selects the simplest model whose performance is within one standard error of the best. The rationale: if we can't statistically distinguish between models, prefer the simpler one. This often produces more interpretable and robust trees.
Cost-complexity pruning has elegant mathematical properties that guarantee its effectiveness and efficiency.
Theorem 1: Unique Minimum
For any $\alpha > 0$, there exists a unique smallest minimizing subtree $T_\alpha \subseteq T_{\max}$ such that: $$R_\alpha(T_\alpha) = \min_{T \subseteq T_{\max}} R_\alpha(T)$$
Theorem 2: Nesting Property
If $\alpha_1 < \alpha_2$, then $T_{\alpha_2} \subseteq T_{\alpha_1}$.
This means optimal subtrees are nested—a critical property enabling the efficient algorithm.
Theorem 3: Finite Sequence
The sequence of optimal subtrees is finite, with at most $|T_{\max}|$ distinct trees.
Theorem 4: Polynomial Complexity
The complete pruning path can be computed in $O(|T_{\max}|^2)$ time in the worst case, or $O(|T_{\max}| \log |T_{\max}|)$ with efficient data structures.
Let's examine a complete practical workflow for applying cost-complexity pruning with scikit-learn.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
from sklearn.tree import DecisionTreeClassifierfrom sklearn.model_selection import train_test_split, cross_val_scorefrom sklearn.datasets import load_breast_cancerimport numpy as np # Complete CCP workflowdef ccp_workflow(): # Load data data = load_breast_cancer() X, y = data.data, data.target X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) # Step 1: Get pruning path clf = DecisionTreeClassifier(random_state=42) path = clf.cost_complexity_pruning_path(X_train, y_train) ccp_alphas = path.ccp_alphas[:-1] # Exclude trivial tree # Step 2: Cross-validate each alpha cv_means, cv_stds = [], [] for alpha in ccp_alphas: tree = DecisionTreeClassifier(ccp_alpha=alpha, random_state=42) scores = cross_val_score(tree, X_train, y_train, cv=5) cv_means.append(scores.mean()) cv_stds.append(scores.std()) # Step 3: Select best alpha (1-SE rule) best_idx = np.argmax(cv_means) best_mean, best_std = cv_means[best_idx], cv_stds[best_idx] threshold = best_mean - best_std / np.sqrt(5) for i, (mean, alpha) in enumerate(zip(cv_means, ccp_alphas)): if mean >= threshold: best_alpha = alpha # Step 4: Train final model final_tree = DecisionTreeClassifier( ccp_alpha=best_alpha, random_state=42 ) final_tree.fit(X_train, y_train) # Step 5: Evaluate print(f"Selected alpha: {best_alpha:.6f}") print(f"Tree depth: {final_tree.get_depth()}") print(f"Number of leaves: {final_tree.get_n_leaves()}") print(f"Test accuracy: {final_tree.score(X_test, y_test):.4f}") return final_tree ccp_workflow()Having studied both strategies in depth, let's consolidate the comparison to guide practical decisions.
| Criterion | Pre-pruning | Post-pruning (CCP) |
|---|---|---|
| Theoretical optimality | Local (greedy) | Global for given α |
| Computational cost | Lower | Higher (full tree + path) |
| Horizon effect risk | Yes | No |
| Hyperparameters | Multiple (depth, samples, etc.) | Single (α) |
| Interpretability of tuning | Direct constraints | Abstract penalty |
| Typical use case | Quick experiments, ensembles | Careful single-tree modeling |
| Combined with CV | Yes, tune each parameter | Yes, tune α |
In practice, combining both approaches often works best: use light pre-pruning (reasonable max_depth, min_samples_leaf) to prevent extreme trees, then apply CCP with CV to fine-tune. This reduces the computational cost of CCP while maintaining its global optimization benefits.
You now understand cost-complexity pruning, the principled approach to decision tree regularization. You can implement the complete CCP workflow and select optimal alpha values using cross-validation. Next, we'll explore specific pre-pruning parameters in detail: minimum samples constraints.