Pruning And Regularization - Learning Module

Loading content...

0/245

Post-pruning: Cost-Complexity Pruning

The Philosophy of Post-pruning

Post-pruning takes a fundamentally different approach from pre-pruning. Instead of preventing overfitting during tree construction, it allows the tree to grow fully—potentially to the point of perfect training accuracy—and then systematically removes branches that don't contribute to generalization.

This approach is sometimes called grow-and-prune or backward pruning. The most principled and widely-used post-pruning method is Cost-Complexity Pruning (CCP), also known as Minimal Cost-Complexity Pruning or Weakest Link Pruning.

Why post-prune?

The key advantage of post-pruning is avoiding the horizon effect:

A split at node $v$ might appear useless in isolation
But this split might enable highly valuable splits in its subtree
Pre-pruning would reject the initial split, losing access to the valuable deeper splits
Post-pruning sees the full picture and can make globally-informed decisions

What You Will Learn

This page provides deep coverage of cost-complexity pruning: the mathematical formulation, the elegant alpha sequence algorithm, practical implementation with cross-validation, and how to interpret the pruning path. You'll understand why CCP is considered the gold standard for decision tree regularization.

The Cost-Complexity Criterion

Cost-complexity pruning is built on an elegant optimization principle: find the subtree that best balances training error against tree complexity.

The Cost-Complexity Measure:

For a tree $T$, define the cost-complexity measure:

$$R_\alpha(T) = R(T) + \alpha \cdot |T|$$

Where:

$R(T)$ = Resubstitution error (training error or total impurity) of tree $T$
$|T|$ = Number of terminal nodes (leaves) in tree $T$
$\alpha \geq 0$ = Complexity parameter (penalty per leaf)

Interpretation:

When $\alpha = 0$: No penalty for complexity → optimal tree is the full tree $T_{\max}$
As $\alpha$ increases: Each additional leaf must justify itself with decreased error
When $\alpha \to \infty$: Any leaf is too expensive → optimal tree is the root alone

The goal is to find $T_\alpha = \arg\min_T R_\alpha(T)$ for a given $\alpha$.

The Bias-Variance Perspective

α controls the bias-variance tradeoff directly. Small α → low bias, high variance (complex tree memorizes training data). Large α → high bias, low variance (simple tree underfits but generalizes). The optimal α balances these extremes.

Weakest Link Cutting Algorithm

The beauty of cost-complexity pruning lies in an efficient algorithm that computes the optimal subtree for all values of $\alpha$ simultaneously.

Key Insight: Nested Optimal Subtrees

As $\alpha$ increases from 0, the optimal subtrees form a nested sequence:

$$T_{\max} = T_0 \supseteq T_1 \supseteq T_2 \supseteq \cdots \supseteq T_K = {\text{root}}$$

Each tree $T_{i+1}$ is obtained by pruning precisely one weakest link from $T_i$.

The Effective Alpha:

For any internal node $t$ with subtree $T_t$, the effective alpha is:

$$\alpha_{\text{eff}}(t) = \frac{R(t) - R(T_t)}{|T_t| - 1}$$

Interpretation:

$R(t)$ = error if we collapse subtree to leaf
$R(T_t)$ = error of the subtree
$|T_t| - 1$ = leaves "saved" by pruning
$\alpha_{\text{eff}}$ = error increase per leaf removed

The weakest link is the node with the smallest $\alpha_{\text{eff}}$ — the node whose subtree provides the least error reduction per complexity unit.

weakest_link_algorithm.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
def compute_effective_alpha(node, tree):
    """
    Compute effective alpha for pruning a subtree.
    
    This is the alpha value at which we become indifferent
    between keeping the subtree and collapsing it to a leaf.
    """
    if node.is_leaf:
        return float('inf')  # Leaves cannot be pruned
    
    # R(t): error if we replace subtree with leaf
    error_as_leaf = compute_node_impurity(node) * node.n_samples
    
    # R(T_t): total error of the subtree
    subtree_error = sum(
        compute_node_impurity(leaf) * leaf.n_samples 
        for leaf in get_leaves(node)
    )
    
    # |T_t|: number of leaves in subtree
    n_leaves = count_leaves(node)
    
    # Effective alpha: error increase per leaf reduction
    alpha_eff = (error_as_leaf - subtree_error) / (n_leaves - 1)
    
    return alpha_eff
 
 
def weakest_link_pruning(tree):
    """
    Generate the complete pruning sequence.
    
    Returns list of (alpha, pruned_tree) pairs representing
    all optimal subtrees as alpha increases.
    """
    pruning_path = [(0.0, tree.copy())]
    current_tree = tree.copy()
    
    while not current_tree.is_single_node():
        # Find node with minimum effective alpha
        min_alpha = float('inf')
        weakest_node = None
        
        for node in get_internal_nodes(current_tree):
            alpha = compute_effective_alpha(node, current_tree)
            if alpha < min_alpha:
                min_alpha = alpha
                weakest_node = node
        
        # Prune the weakest link
        prune_subtree(current_tree, weakest_node)
        pruning_path.append((min_alpha, current_tree.copy()))
    
    return pruning_path

Understanding the Pruning Path

The weakest link algorithm produces a pruning path: a sequence of subtrees and their corresponding threshold alpha values. This path is fundamental to both understanding and selecting the final pruned tree.

Properties of the Pruning Path:

Monotonic Alphas: $\alpha_0 = 0 < \alpha_1 < \alpha_2 < \cdots < \alpha_K$
Nested Trees: $T_0 \supset T_1 \supset T_2 \supset \cdots \supset T_K$
Optimal Intervals: Tree $T_i$ is optimal for $\alpha \in [\alpha_i, \alpha_{i+1})$
Complete Path: Covers all possible optimal subtrees

Visualizing the Path:

The pruning path can be visualized in multiple ways:

Error vs. Alpha: Shows how training error increases as alpha increases
Leaves vs. Alpha: Shows tree complexity decreasing with alpha
CV Error vs. Alpha: The key plot for model selection

pruning_path_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
from sklearn.tree import DecisionTreeClassifier
import numpy as np
 
def analyze_pruning_path(X, y):
    """
    Extract and analyze the cost-complexity pruning path.
    """
    # Fit a full tree
    clf = DecisionTreeClassifier(random_state=42)
    clf.fit(X, y)
    
    # Get the pruning path
    path = clf.cost_complexity_pruning_path(X, y)
    ccp_alphas = path.ccp_alphas  # Alpha values
    impurities = path.impurities   # Total impurity at each alpha
    
    # Analyze trees along the path
    results = []
    for alpha in ccp_alphas:
        tree = DecisionTreeClassifier(
            ccp_alpha=alpha, 
            random_state=42
        )
        tree.fit(X, y)
        results.append({
            'alpha': alpha,
            'n_leaves': tree.get_n_leaves(),
            'depth': tree.get_depth(),
            'train_score': tree.score(X, y)
        })
    
    return ccp_alphas, impurities, results
 
# Key insight: As alpha increases:
# - n_leaves: decreases monotonically
# - train_score: decreases monotonically
# - test_score: increases then decreases (sweet spot!)

The Impurity Path

The 'impurities' in scikit-learn's pruning path represent total weighted impurity (sum over all leaves of impurity × n_samples). This increases monotonically with alpha, as simpler trees have higher impurity. The impurity at alpha=0 represents the full tree's training impurity.

Selecting Alpha via Cross-Validation

The pruning path gives us all optimal subtrees, but we still need to select the best $\alpha$ for our specific problem. Cross-validation is the standard approach.

The Cross-Validation Procedure:

Compute the pruning path on the training data → get candidate alphas
For each candidate alpha:
- Perform k-fold cross-validation
- Compute mean and standard error of test accuracy
Select alpha that maximizes cross-validated performance

Common Selection Rules:

Max Rule: Select $\alpha$ with highest mean CV score
1-SE Rule: Select largest $\alpha$ within one standard error of the maximum (simpler model with comparable performance)

ccp_cross_validation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
import numpy as np
 
def select_alpha_cv(X, y, cv=5):
    """
    Select optimal ccp_alpha using cross-validation.
    
    Implements both max rule and 1-SE rule.
    """
    # Get pruning path from full tree
    clf = DecisionTreeClassifier(random_state=42)
    path = clf.cost_complexity_pruning_path(X, y)
    ccp_alphas = path.ccp_alphas
    
    # Skip very large alphas that produce trivial trees
    ccp_alphas = ccp_alphas[:-1]
    
    # Cross-validate each alpha
    cv_scores = []
    for alpha in ccp_alphas:
        tree = DecisionTreeClassifier(
            ccp_alpha=alpha,
            random_state=42
        )
        scores = cross_val_score(tree, X, y, cv=cv)
        cv_scores.append({
            'alpha': alpha,
            'mean': scores.mean(),
            'std': scores.std(),
            'se': scores.std() / np.sqrt(cv)
        })
    
    # Max rule: best mean CV score
    best_idx_max = np.argmax([s['mean'] for s in cv_scores])
    alpha_max = cv_scores[best_idx_max]['alpha']
    
    # 1-SE rule: largest alpha within 1 SE of max
    max_score = cv_scores[best_idx_max]['mean']
    max_se = cv_scores[best_idx_max]['se']
    threshold = max_score - max_se
    
    alpha_1se = alpha_max
    for result in cv_scores:
        if result['mean'] >= threshold:
            alpha_1se = result['alpha']
            
    return {
        'alpha_max_rule': alpha_max,
        'alpha_1se_rule': alpha_1se,
        'cv_results': cv_scores
    }

The 1-SE Rule

The 1-standard-error rule, popularized by Breiman, selects the simplest model whose performance is within one standard error of the best. The rationale: if we can't statistically distinguish between models, prefer the simpler one. This often produces more interpretable and robust trees.

Mathematical Properties of CCP

Cost-complexity pruning has elegant mathematical properties that guarantee its effectiveness and efficiency.

Theorem 1: Unique Minimum

For any $\alpha > 0$, there exists a unique smallest minimizing subtree $T_\alpha \subseteq T_{\max}$ such that: $$R_\alpha(T_\alpha) = \min_{T \subseteq T_{\max}} R_\alpha(T)$$

Theorem 2: Nesting Property

If $\alpha_1 < \alpha_2$, then $T_{\alpha_2} \subseteq T_{\alpha_1}$.

This means optimal subtrees are nested—a critical property enabling the efficient algorithm.

Theorem 3: Finite Sequence

The sequence of optimal subtrees is finite, with at most $|T_{\max}|$ distinct trees.

Theorem 4: Polynomial Complexity

The complete pruning path can be computed in $O(|T_{\max}|^2)$ time in the worst case, or $O(|T_{\max}| \log |T_{\max}|)$ with efficient data structures.

Why These Properties Matter

•Uniqueness: No ambiguity in the solution for any alpha
•Nesting: Enables efficient path computation via sequential pruning
•Finiteness: Guarantees algorithm termination; manageable number of candidates
•Polynomial time: Practical even for large trees

Cost-Complexity Pruning in Practice

Let's examine a complete practical workflow for applying cost-complexity pruning with scikit-learn.

complete_ccp_workflow.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.datasets import load_breast_cancer
import numpy as np
 
# Complete CCP workflow
def ccp_workflow():
    # Load data
    data = load_breast_cancer()
    X, y = data.data, data.target
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    
    # Step 1: Get pruning path
    clf = DecisionTreeClassifier(random_state=42)
    path = clf.cost_complexity_pruning_path(X_train, y_train)
    ccp_alphas = path.ccp_alphas[:-1]  # Exclude trivial tree
    
    # Step 2: Cross-validate each alpha
    cv_means, cv_stds = [], []
    for alpha in ccp_alphas:
        tree = DecisionTreeClassifier(ccp_alpha=alpha, random_state=42)
        scores = cross_val_score(tree, X_train, y_train, cv=5)
        cv_means.append(scores.mean())
        cv_stds.append(scores.std())
    
    # Step 3: Select best alpha (1-SE rule)
    best_idx = np.argmax(cv_means)
    best_mean, best_std = cv_means[best_idx], cv_stds[best_idx]
    threshold = best_mean - best_std / np.sqrt(5)
    
    for i, (mean, alpha) in enumerate(zip(cv_means, ccp_alphas)):
        if mean >= threshold:
            best_alpha = alpha
    
    # Step 4: Train final model
    final_tree = DecisionTreeClassifier(
        ccp_alpha=best_alpha,
        random_state=42
    )
    final_tree.fit(X_train, y_train)
    
    # Step 5: Evaluate
    print(f"Selected alpha: {best_alpha:.6f}")
    print(f"Tree depth: {final_tree.get_depth()}")
    print(f"Number of leaves: {final_tree.get_n_leaves()}")
    print(f"Test accuracy: {final_tree.score(X_test, y_test):.4f}")
    
    return final_tree
 
ccp_workflow()

Pre-pruning vs Post-pruning: Final Comparison

Having studied both strategies in depth, let's consolidate the comparison to guide practical decisions.

Comprehensive Strategy Comparison
Criterion	Pre-pruning	Post-pruning (CCP)
Theoretical optimality	Local (greedy)	Global for given α
Computational cost	Lower	Higher (full tree + path)
Horizon effect risk	Yes	No
Hyperparameters	Multiple (depth, samples, etc.)	Single (α)
Interpretability of tuning	Direct constraints	Abstract penalty
Typical use case	Quick experiments, ensembles	Careful single-tree modeling
Combined with CV	Yes, tune each parameter	Yes, tune α

Practical Recommendation

In practice, combining both approaches often works best: use light pre-pruning (reasonable max_depth, min_samples_leaf) to prevent extreme trees, then apply CCP with CV to fine-tune. This reduces the computational cost of CCP while maintaining its global optimization benefits.

Summary

Key Takeaways

•Cost-complexity pruning balances training error against tree complexity via R_α(T) = R(T) + α|T|
•Weakest link algorithm efficiently computes all optimal subtrees by iteratively pruning nodes with minimum effective alpha
•The pruning path is a nested sequence of trees optimal for increasing alpha values
•Cross-validation selects alpha: use max rule for best accuracy, 1-SE rule for parsimony
•Mathematical guarantees ensure uniqueness, nesting, and polynomial-time computation
•Combine strategies in practice: light pre-pruning plus CCP for efficiency and quality

Page Complete

You now understand cost-complexity pruning, the principled approach to decision tree regularization. You can implement the complete CCP workflow and select optimal alpha values using cross-validation. Next, we'll explore specific pre-pruning parameters in detail: minimum samples constraints.