Loading content...
Decision trees are powerful classifiers, but they often suffer from overfitting when grown too deep. One of the most effective techniques to combat this is post-hoc complexity-based simplification, where an already-grown tree is strategically simplified by collapsing subtrees into single leaf nodes.
The key insight behind this technique is finding the optimal complexity threshold (commonly denoted as α) at which each internal node should be "pruned" or collapsed. By computing these thresholds for every internal node, we can systematically simplify the tree from its weakest branches upward, achieving the best trade-off between tree complexity and predictive accuracy.
For any internal node in a decision tree, we face a fundamental question: Is the additional complexity of this subtree worth the improved accuracy it provides?
Consider a subtree rooted at node t:
The effective alpha threshold for this node represents the complexity penalty at which the trade-off becomes equal—i.e., the point at which keeping the subtree provides no net benefit over collapsing it.
The effective alpha for an internal node t with subtree T_t is computed as:
$$\alpha_{eff}(t) = \frac{R(t) - R(T_t)}{|T_t| - 1}$$
Where:
This formula captures the "cost" of keeping the subtree in terms of error reduction per additional leaf. A lower alpha means the subtree provides little benefit per unit complexity, making it a candidate for early pruning.
The decision tree is represented as nested dictionaries with the following schema:
| Key | Type | Description |
|---|---|---|
samples | integer | Number of training samples reaching this node |
errors | integer | Misclassification count if this node were converted to a leaf |
left | dict or None | Left child subtree (None for leaf nodes) |
right | dict or None | Right child subtree (None for leaf nodes) |
A node is a leaf if both left and right are None.
Implement a function that:
The sorted order is significant: the node with the smallest alpha is the "weakest link" and would be the first candidate for pruning in a sequential simplification process.
tree = {
'samples': 200, 'errors': 80,
'left': {
'samples': 120, 'errors': 30,
'left': {'samples': 70, 'errors': 10, 'left': None, 'right': None},
'right': {'samples': 50, 'errors': 15, 'left': None, 'right': None}
},
'right': {'samples': 80, 'errors': 25, 'left': None, 'right': None}
}[5.0, 15.0]This tree has 2 internal nodes (the root and the left child of root).
Left internal node analysis:
Root node analysis:
Sorted result: [5.0, 15.0]
The left internal node (α = 5.0) is the "weakest link" and would be pruned first during iterative complexity reduction.
tree = {
'samples': 100, 'errors': 40,
'left': {'samples': 60, 'errors': 15, 'left': None, 'right': None},
'right': {'samples': 40, 'errors': 10, 'left': None, 'right': None}
}[15.0]This is a simple tree with only the root as an internal node (both children are leaves).
Root node analysis:
Only one internal node exists, so the result is [15.0].
tree = {
'samples': 400, 'errors': 120,
'left': {
'samples': 200, 'errors': 50,
'left': {'samples': 100, 'errors': 20, 'left': None, 'right': None},
'right': {'samples': 100, 'errors': 20, 'left': None, 'right': None}
},
'right': {
'samples': 200, 'errors': 60,
'left': {'samples': 100, 'errors': 25, 'left': None, 'right': None},
'right': {'samples': 100, 'errors': 25, 'left': None, 'right': None}
}
}[10.0, 10.0, 10.0]This is a balanced tree with 3 internal nodes (root, left subtree root, right subtree root).
Left internal node:
Right internal node:
Root node:
All three internal nodes have identical alpha values, indicating this tree is "balanced" in terms of complexity-accuracy trade-offs at all levels.
Constraints