Loading problem...
In decision tree learning, the algorithm must determine the optimal way to partition data at each node. The fundamental question is: which feature and at what value should we split to best separate the classes? This problem asks you to implement the core split selection mechanism that powers decision tree construction using information gain based on entropy.
Entropy is a measure of impurity or uncertainty in a dataset. For a set of samples with binary or multi-class labels, entropy quantifies how mixed the classes are. A dataset where all samples belong to the same class has entropy of 0 (perfectly pure), while a dataset with equally distributed classes has maximum entropy (maximum uncertainty).
For a dataset with ( K ) classes, the entropy ( H(S) ) is calculated as:
$$H(S) = -\sum_{k=1}^{K} p_k \log_2(p_k)$$
where ( p_k ) is the proportion of samples belonging to class ( k ). By convention, ( 0 \log_2(0) = 0 ).
Information Gain measures how much a split reduces entropy. When we split a parent node ( S ) into child nodes ( S_L ) (left) and ( S_R ) (right), the information gain is:
$$IG(S, \text{split}) = H(S) - \left( \frac{|S_L|}{|S|} H(S_L) + \frac{|S_R|}{|S|} H(S_R) \right)$$
The weighted average of child entropies accounts for the relative sizes of the partitions.
Implement a function that examines all possible binary splits across all features and identifies the split that maximizes information gain. For each feature:
Return a tuple containing:
Handle these scenarios gracefully:
X = np.array([[1.0], [2.0], [3.0], [4.0]])
y = np.array([0, 0, 1, 1])(0, 2.5, 1.0)This dataset has one feature with values [1, 2, 3, 4] and corresponding labels [0, 0, 1, 1].
Step 1: Calculate Parent Entropy
Step 2: Evaluate Candidate Thresholds Candidate thresholds are midpoints: 1.5, 2.5, 3.5
Threshold 2.5:
This threshold achieves the maximum possible information gain of 1.0, perfectly separating the two classes.
X = np.array([[1.0, 5.0], [2.0, 4.0], [3.0, 3.0], [4.0, 2.0], [5.0, 1.0]])
y = np.array([0, 0, 1, 1, 1])(0, 2.5, 0.971)This dataset has two features. We must evaluate splits on both features.
Parent Entropy Calculation:
Feature 0 Analysis (values: [1, 2, 3, 4, 5]):
Feature 1 Analysis (values: [5, 4, 3, 2, 1]):
The optimal split is Feature 0 at threshold 2.5 with information gain ≈ 0.971.
X = np.array([[0.0, 1.0, 2.0], [1.0, 2.0, 3.0], [5.0, 6.0, 7.0], [6.0, 7.0, 8.0]])
y = np.array([0, 0, 1, 1])(0, 3.0, 1.0)This 3-feature dataset demonstrates how the algorithm selects among multiple features.
Parent Entropy: With 2 samples each of class 0 and class 1, H(parent) = 1.0.
Feature 0 (values: [0, 1, 5, 6]):
Features 1 and 2 follow similar patterns and also achieve perfect splits at their respective gap thresholds. Feature 0 at threshold 3.0 is returned as it's the first perfect split encountered.
The result is (0, 3.0, 1.0), representing feature index 0, split threshold 3.0, and maximum information gain of 1.0.
Constraints