Pruning And Regularization - Learning Module

Loading content...

0/278

Pre-pruning: Early Stopping Strategies

The Overfitting Problem in Decision Trees

Decision trees possess a remarkable property: given sufficient depth and no constraints, they can perfectly classify any training dataset by creating a unique path for every single training example. While this might sound desirable, it represents one of the most severe forms of overfitting in machine learning.

Consider what happens when a decision tree grows without restraint:

Perfect Training Accuracy: The tree creates increasingly specific rules until every training point is correctly classified
Explosive Complexity: The tree develops hundreds or thousands of leaf nodes, each potentially representing a single training example
Memorization Over Learning: The tree memorizes the training data rather than learning generalizable patterns
Catastrophic Test Performance: When faced with new data, the over-specific rules fail to generalize

This fundamental tension between training accuracy and generalization capability is where pruning enters the picture.

What You Will Learn

This page covers pre-pruning (early stopping) comprehensively: the philosophy behind stopping tree growth early, the mathematical basis for various stopping criteria, practical parameter selection, and the tradeoffs between different pre-pruning strategies.

Pre-pruning vs Post-pruning Philosophy

There are two fundamentally different philosophies for controlling decision tree complexity:

Pre-pruning (Early Stopping): Stop growing the tree before it becomes too complex. This approach sets constraints during the tree construction process, preventing the tree from ever reaching full depth.

Post-pruning (Pruning Back): Allow the tree to grow fully, then remove branches that don't improve generalization. This approach builds the complete tree first, then simplifies it.

Each approach has distinct characteristics:

Pre-pruning vs Post-pruning Comparison
Aspect	Pre-pruning	Post-pruning
Philosophy	Prevention: Stop before overfitting	Cure: Fix overfitting after it occurs
Computational Cost	Lower: Avoids growing unnecessary branches	Higher: Must build full tree first
Risk of Underfitting	Higher: May stop too early	Lower: Has full tree to work with
Optimality	Greedy: Local decisions without global view	More global: Can evaluate full structure
Implementation	Simpler: Add stopping conditions	Complex: Requires subtree evaluation
Interpretability	Direct: Parameters map to tree constraints	Indirect: Pruning strength less intuitive

The Horizon Effect

Pre-pruning suffers from the 'horizon effect': a split that appears useless might enable highly valuable splits deeper in the tree. By stopping early, we may miss these beneficial patterns. This is the primary theoretical argument for post-pruning, though in practice, pre-pruning often works well with proper parameter tuning.

Maximum Depth: The Primary Complexity Control

The maximum depth parameter is the most intuitive and widely-used pre-pruning strategy. It directly limits how many sequential decisions the tree can make from root to any leaf.

Mathematical Definition:

For a node $v$ at depth $d(v)$, where the root has $d(\text{root}) = 0$:

$$\text{Split node } v \text{ only if } d(v) < d_{\max}$$

Depth and Model Complexity:

The relationship between depth and tree capacity is exponential:

Depth 1: Maximum 2 leaves → Can represent 2 distinct predictions
Depth 2: Maximum 4 leaves → Can represent 4 distinct predictions
Depth $d$: Maximum $2^d$ leaves → Can represent $2^d$ distinct predictions

This exponential relationship means small changes in max_depth have dramatic effects on model complexity.

max_depth_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
 
def analyze_max_depth(X, y, depth_range=range(1, 21)):
    """
    Analyze the effect of max_depth on train/test performance.
    
    This function demonstrates the bias-variance tradeoff
    as max_depth increases.
    """
    train_scores = []
    cv_scores = []
    cv_stds = []
    
    for depth in depth_range:
        clf = DecisionTreeClassifier(
            max_depth=depth,
            random_state=42
        )
        
        # Training accuracy (measure of fit)
        clf.fit(X, y)
        train_scores.append(clf.score(X, y))
        
        # Cross-validation accuracy (measure of generalization)
        cv_result = cross_val_score(clf, X, y, cv=5)
        cv_scores.append(cv_result.mean())
        cv_stds.append(cv_result.std())
    
    return {
        'depths': list(depth_range),
        'train_scores': train_scores,
        'cv_scores': cv_scores,
        'cv_stds': cv_stds,
        'optimal_depth': depth_range[np.argmax(cv_scores)]
    }
 
# Example usage pattern:
# results = analyze_max_depth(X_train, y_train)
# print(f"Optimal depth: {results['optimal_depth']}")

Practical Depth Guidelines

For most classification problems: start with max_depth=3-5 for interpretable models, 5-10 for balanced performance, and 10-20 for maximum accuracy. Depths beyond 20 rarely improve performance and often indicate overfitting. For regression trees, slightly deeper trees (8-15) are often beneficial due to the continuous nature of targets.

Minimum Samples for Splitting

The minimum samples split parameter requires a minimum number of training examples in a node before considering a split. This prevents the tree from making decisions based on insufficient evidence.

Mathematical Formulation:

For a node $v$ containing $n_v$ samples:

$$\text{Consider splitting } v \text{ only if } n_v \geq n_{\min_split}$$

Statistical Motivation:

The statistical rationale is compelling: splits based on small sample sizes have high variance. Consider:

Sample Size and Confidence: With $n$ samples, estimation error scales as $O(1/\sqrt{n})$
Split Quality Estimation: Impurity reduction estimates become unreliable for small $n$
Spurious Patterns: Small samples more likely to contain noise that appears as signal

Practical Impact:

Setting min_samples_split=20 means:

Nodes with fewer than 20 samples become leaves automatically
The tree cannot create arbitrarily specific rules
Each split decision is based on statistically meaningful samples

Selecting min_samples_split

•Small datasets (n < 1000): Use 5-20, or 1-5% of training size
•Medium datasets (1000 < n < 100K): Use 20-100, or 0.5-2% of training size
•Large datasets (n > 100K): Use 50-500, or 0.1-0.5% of training size
•Noisy data: Increase by 2-5x to require more evidence for splits
•Class imbalance: Consider the minority class size; ensure splits can isolate minority patterns

Minimum Samples per Leaf

The minimum samples leaf parameter ensures every leaf node contains at least a specified number of training examples. Unlike min_samples_split which constrains when to consider splitting, this parameter constrains the outcome of splits.

Mathematical Formulation:

For any split of node $v$ into children $v_L$ and $v_R$:

$$\text{Allow split only if } n_{v_L} \geq n_{\min_leaf} \text{ AND } n_{v_R} \geq n_{\min_leaf}$$

Why This Matters:

Consider a node with 100 samples. With min_samples_split=10, we can split it. But if the best split produces children of sizes 95 and 5, and min_samples_leaf=10, this split is rejected despite passing the split threshold.

Smoothing Effect:

Minimum leaf samples creates smoother decision boundaries:

Larger values → Larger leaf regions → Smoother predictions
Smaller values → Smaller leaf regions → More granular but potentially noisy predictions

Benefits of Higher min_samples_leaf

•Smoother decision boundaries
•More robust probability estimates
•Better generalization on noisy data
•Faster prediction (fewer leaves)
•Easier interpretation

Drawbacks of Higher min_samples_leaf

•May miss fine-grained patterns
•Underfitting on complex boundaries
•Reduced training accuracy
•Less adaptive to local structure
•May hurt minority class detection

Relationship Between Parameters

min_samples_leaf provides a stronger constraint than min_samples_split. If min_samples_leaf=k, then effectively min_samples_split must be at least 2k (since both children need k samples). Many practitioners find min_samples_leaf more intuitive because it directly controls leaf node reliability.

Minimum Impurity Decrease

The minimum impurity decrease parameter requires that each split must reduce impurity by at least a specified threshold. This directly controls the quality-vs-complexity tradeoff.

Mathematical Formulation:

For a split of node $v$ with impurity $I(v)$ into children $v_L$ and $v_R$:

$$\Delta I = I(v) - \left(\frac{n_{v_L}}{n_v} I(v_L) + \frac{n_{v_R}}{n_v} I(v_R)\right)$$

$$\text{Allow split only if } \Delta I \geq \Delta I_{\min}$$

Weighted Information Gain:

The weighted impurity decrease (as used in scikit-learn) accounts for the fraction of total samples:

$$\Delta I_{\text{weighted}} = \frac{n_v}{N} \cdot \Delta I$$

where $N$ is the total number of training samples.

impurity_decrease_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import numpy as np
from sklearn.tree import DecisionTreeClassifier
 
def demonstrate_impurity_decrease(X, y):
    """
    Demonstrate the effect of min_impurity_decrease.
    
    Higher values create simpler trees by requiring
    each split to be more 'worthwhile'.
    """
    thresholds = [0.0, 0.001, 0.01, 0.05, 0.1]
    
    results = []
    for threshold in thresholds:
        clf = DecisionTreeClassifier(
            min_impurity_decrease=threshold,
            random_state=42
        )
        clf.fit(X, y)
        
        results.append({
            'threshold': threshold,
            'n_leaves': clf.get_n_leaves(),
            'max_depth': clf.get_depth(),
            'train_accuracy': clf.score(X, y)
        })
    
    return results
 
# Typical output pattern:
# threshold=0.0:   ~100 leaves, depth=15, acc=1.00
# threshold=0.01:  ~30 leaves,  depth=8,  acc=0.95
# threshold=0.1:   ~5 leaves,   depth=3,  acc=0.82

Choosing the Threshold

For Gini impurity (range 0-0.5) and entropy (range 0-1), typical min_impurity_decrease values are 0.001-0.01 for mild regularization, 0.01-0.05 for moderate regularization, and 0.05-0.1 for aggressive pruning. Values above 0.1 often result in underfitting.

Maximum Features per Split

The max_features parameter limits how many features are considered at each split. While primarily used in Random Forests for decorrelation, it also serves as a regularization technique for single trees.

Options:

'sqrt' or 'auto': Consider $\sqrt{p}$ features (common for classification)
'log2': Consider $\log_2(p)$ features
Integer $k$: Consider exactly $k$ features
Float $f \in (0, 1]$: Consider $\lfloor f \cdot p \rfloor$ features
None: Consider all $p$ features

Regularization Effect:

By not always selecting the globally best split, max_features < p introduces randomness that:

Prevents the tree from always exploiting the same dominant features
Creates implicit feature selection under uncertainty
Can improve generalization when features are correlated

Single Trees vs Ensembles

For single decision trees, max_features is rarely beneficial—you typically want the best split at each node. The power of max_features emerges in Random Forests, where the induced randomness enables ensemble diversity. For standalone trees, use max_features=None unless you have specific regularization needs.

Maximum Leaf Nodes

The max_leaf_nodes parameter directly limits the number of terminal nodes (leaves) in the tree. This provides the most direct control over model complexity.

Best-First Tree Growing:

When max_leaf_nodes is set, the tree growing algorithm changes fundamentally:

Standard (depth-first): Fully expand one branch before moving to siblings
Best-first: At each step, expand the leaf that provides maximum impurity reduction globally

Best-first ensures that with a limited leaf budget, we allocate leaves to the most valuable splits.

Complexity Interpretation:

Unlike depth, leaf count maps directly to prediction granularity:

$L$ leaves = exactly $L$ distinct prediction regions
$L$ leaves = exactly $L-1$ internal nodes (splits)
Model complexity scales linearly with $L$

When to Use max_leaf_nodes

•Interpretability requirements: Stakeholders need to understand every leaf
•Memory constraints: Each leaf requires storage for predictions/statistics
•Ensemble base learners: Controlling leaf count can be more stable than depth for boosting
•Direct complexity control: When you want exactly N prediction regions
•Comparison studies: Ensures equal complexity across different tree configurations

Pre-pruning Strategy Summary

Pre-pruning provides a computationally efficient approach to controlling decision tree complexity. By setting constraints during tree construction, we avoid the overhead of building and then removing unnecessary structure.

Pre-pruning Parameters Quick Reference
Parameter	Controls	Typical Range	Primary Use Case
max_depth	Sequential decisions	3-20	Overall complexity cap
min_samples_split	When to consider splitting	2-100 or 1-5%	Statistical reliability
min_samples_leaf	Minimum leaf population	1-50 or 0.5-2%	Prediction smoothness
min_impurity_decrease	Required split quality	0.0-0.1	Marginal split prevention
max_leaf_nodes	Total prediction regions	5-100	Direct complexity control
max_features	Features per split	sqrt(p) or all	Feature regularization

Page Complete

You now understand pre-pruning strategies for decision trees. These early stopping techniques provide efficient control over tree complexity during construction. Next, we'll explore post-pruning (cost-complexity pruning), which takes the opposite approach: grow first, then simplify.

Pre-pruning: Early Stopping Strategies

The Overfitting Problem in Decision Trees

Consider what happens when a decision tree grows without restraint:

Perfect Training Accuracy: The tree creates increasingly specific rules until every training point is correctly classified
Explosive Complexity: The tree develops hundreds or thousands of leaf nodes, each potentially representing a single training example
Memorization Over Learning: The tree memorizes the training data rather than learning generalizable patterns
Catastrophic Test Performance: When faced with new data, the over-specific rules fail to generalize

This fundamental tension between training accuracy and generalization capability is where pruning enters the picture.

What You Will Learn

Pre-pruning vs Post-pruning Philosophy

There are two fundamentally different philosophies for controlling decision tree complexity:

Post-pruning (Pruning Back): Allow the tree to grow fully, then remove branches that don't improve generalization. This approach builds the complete tree first, then simplifies it.

Each approach has distinct characteristics:

Pre-pruning vs Post-pruning Comparison
Aspect	Pre-pruning	Post-pruning
Philosophy	Prevention: Stop before overfitting	Cure: Fix overfitting after it occurs
Computational Cost	Lower: Avoids growing unnecessary branches	Higher: Must build full tree first
Risk of Underfitting	Higher: May stop too early	Lower: Has full tree to work with
Optimality	Greedy: Local decisions without global view	More global: Can evaluate full structure
Implementation	Simpler: Add stopping conditions	Complex: Requires subtree evaluation
Interpretability	Direct: Parameters map to tree constraints	Indirect: Pruning strength less intuitive

The Horizon Effect

Maximum Depth: The Primary Complexity Control

The maximum depth parameter is the most intuitive and widely-used pre-pruning strategy. It directly limits how many sequential decisions the tree can make from root to any leaf.

Mathematical Definition:

For a node $v$ at depth $d(v)$, where the root has $d(\text{root}) = 0$:

$$\text{Split node } v \text{ only if } d(v) < d_{\max}$$

Depth and Model Complexity:

The relationship between depth and tree capacity is exponential:

Depth 1: Maximum 2 leaves → Can represent 2 distinct predictions
Depth 2: Maximum 4 leaves → Can represent 4 distinct predictions
Depth $d$: Maximum $2^d$ leaves → Can represent $2^d$ distinct predictions

This exponential relationship means small changes in max_depth have dramatic effects on model complexity.

max_depth_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
 
def analyze_max_depth(X, y, depth_range=range(1, 21)):
    """
    Analyze the effect of max_depth on train/test performance.
    
    This function demonstrates the bias-variance tradeoff
    as max_depth increases.
    """
    train_scores = []
    cv_scores = []
    cv_stds = []
    
    for depth in depth_range:
        clf = DecisionTreeClassifier(
            max_depth=depth,
            random_state=42
        )
        
        # Training accuracy (measure of fit)
        clf.fit(X, y)
        train_scores.append(clf.score(X, y))
        
        # Cross-validation accuracy (measure of generalization)
        cv_result = cross_val_score(clf, X, y, cv=5)
        cv_scores.append(cv_result.mean())
        cv_stds.append(cv_result.std())
    
    return {
        'depths': list(depth_range),
        'train_scores': train_scores,
        'cv_scores': cv_scores,
        'cv_stds': cv_stds,
        'optimal_depth': depth_range[np.argmax(cv_scores)]
    }
 
# Example usage pattern:
# results = analyze_max_depth(X_train, y_train)
# print(f"Optimal depth: {results['optimal_depth']}")

Practical Depth Guidelines

Minimum Samples for Splitting

Mathematical Formulation:

For a node $v$ containing $n_v$ samples:

$$\text{Consider splitting } v \text{ only if } n_v \geq n_{\min_split}$$

Statistical Motivation:

The statistical rationale is compelling: splits based on small sample sizes have high variance. Consider:

Sample Size and Confidence: With $n$ samples, estimation error scales as $O(1/\sqrt{n})$
Split Quality Estimation: Impurity reduction estimates become unreliable for small $n$
Spurious Patterns: Small samples more likely to contain noise that appears as signal

Practical Impact:

Setting min_samples_split=20 means:

Nodes with fewer than 20 samples become leaves automatically
The tree cannot create arbitrarily specific rules
Each split decision is based on statistically meaningful samples

Selecting min_samples_split

•Small datasets (n < 1000): Use 5-20, or 1-5% of training size
•Medium datasets (1000 < n < 100K): Use 20-100, or 0.5-2% of training size
•Large datasets (n > 100K): Use 50-500, or 0.1-0.5% of training size
•Noisy data: Increase by 2-5x to require more evidence for splits
•Class imbalance: Consider the minority class size; ensure splits can isolate minority patterns

Minimum Samples per Leaf

Mathematical Formulation:

For any split of node $v$ into children $v_L$ and $v_R$:

$$\text{Allow split only if } n_{v_L} \geq n_{\min_leaf} \text{ AND } n_{v_R} \geq n_{\min_leaf}$$

Why This Matters:

Smoothing Effect:

Minimum leaf samples creates smoother decision boundaries:

Larger values → Larger leaf regions → Smoother predictions
Smaller values → Smaller leaf regions → More granular but potentially noisy predictions

Benefits of Higher min_samples_leaf

•Smoother decision boundaries
•More robust probability estimates
•Better generalization on noisy data
•Faster prediction (fewer leaves)
•Easier interpretation

Drawbacks of Higher min_samples_leaf

•May miss fine-grained patterns
•Underfitting on complex boundaries
•Reduced training accuracy
•Less adaptive to local structure
•May hurt minority class detection

Relationship Between Parameters

Minimum Impurity Decrease

The minimum impurity decrease parameter requires that each split must reduce impurity by at least a specified threshold. This directly controls the quality-vs-complexity tradeoff.

Mathematical Formulation:

For a split of node $v$ with impurity $I(v)$ into children $v_L$ and $v_R$:

$$\Delta I = I(v) - \left(\frac{n_{v_L}}{n_v} I(v_L) + \frac{n_{v_R}}{n_v} I(v_R)\right)$$

$$\text{Allow split only if } \Delta I \geq \Delta I_{\min}$$

Weighted Information Gain:

The weighted impurity decrease (as used in scikit-learn) accounts for the fraction of total samples:

$$\Delta I_{\text{weighted}} = \frac{n_v}{N} \cdot \Delta I$$

where $N$ is the total number of training samples.

impurity_decrease_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import numpy as np
from sklearn.tree import DecisionTreeClassifier
 
def demonstrate_impurity_decrease(X, y):
    """
    Demonstrate the effect of min_impurity_decrease.
    
    Higher values create simpler trees by requiring
    each split to be more 'worthwhile'.
    """
    thresholds = [0.0, 0.001, 0.01, 0.05, 0.1]
    
    results = []
    for threshold in thresholds:
        clf = DecisionTreeClassifier(
            min_impurity_decrease=threshold,
            random_state=42
        )
        clf.fit(X, y)
        
        results.append({
            'threshold': threshold,
            'n_leaves': clf.get_n_leaves(),
            'max_depth': clf.get_depth(),
            'train_accuracy': clf.score(X, y)
        })
    
    return results
 
# Typical output pattern:
# threshold=0.0:   ~100 leaves, depth=15, acc=1.00
# threshold=0.01:  ~30 leaves,  depth=8,  acc=0.95
# threshold=0.1:   ~5 leaves,   depth=3,  acc=0.82

Choosing the Threshold

Maximum Features per Split

Options:

'sqrt' or 'auto': Consider $\sqrt{p}$ features (common for classification)
'log2': Consider $\log_2(p)$ features
Integer $k$: Consider exactly $k$ features
Float $f \in (0, 1]$: Consider $\lfloor f \cdot p \rfloor$ features
None: Consider all $p$ features

Regularization Effect:

By not always selecting the globally best split, max_features < p introduces randomness that:

Prevents the tree from always exploiting the same dominant features
Creates implicit feature selection under uncertainty
Can improve generalization when features are correlated

Single Trees vs Ensembles

Maximum Leaf Nodes

The max_leaf_nodes parameter directly limits the number of terminal nodes (leaves) in the tree. This provides the most direct control over model complexity.

Best-First Tree Growing:

When max_leaf_nodes is set, the tree growing algorithm changes fundamentally:

Standard (depth-first): Fully expand one branch before moving to siblings
Best-first: At each step, expand the leaf that provides maximum impurity reduction globally

Best-first ensures that with a limited leaf budget, we allocate leaves to the most valuable splits.

Complexity Interpretation:

Unlike depth, leaf count maps directly to prediction granularity:

$L$ leaves = exactly $L$ distinct prediction regions
$L$ leaves = exactly $L-1$ internal nodes (splits)
Model complexity scales linearly with $L$

When to Use max_leaf_nodes

•Interpretability requirements: Stakeholders need to understand every leaf
•Memory constraints: Each leaf requires storage for predictions/statistics
•Ensemble base learners: Controlling leaf count can be more stable than depth for boosting
•Direct complexity control: When you want exactly N prediction regions
•Comparison studies: Ensures equal complexity across different tree configurations

Pre-pruning Strategy Summary

Pre-pruning Parameters Quick Reference
Parameter	Controls	Typical Range	Primary Use Case
max_depth	Sequential decisions	3-20	Overall complexity cap
min_samples_split	When to consider splitting	2-100 or 1-5%	Statistical reliability
min_samples_leaf	Minimum leaf population	1-50 or 0.5-2%	Prediction smoothness
min_impurity_decrease	Required split quality	0.0-0.1	Marginal split prevention
max_leaf_nodes	Total prediction regions	5-100	Direct complexity control
max_features	Features per split	sqrt(p) or all	Feature regularization

Page Complete