Loading problem...
Decision trees are powerful and interpretable machine learning models that recursively partition the feature space into regions that are as homogeneous as possible with respect to the target variable. At the heart of this partitioning process lies the split selection algorithm—the mechanism that determines which feature and threshold combination will best separate the data at each node.
The Gini impurity is one of the most widely used criteria for evaluating the quality of a split in classification trees. For a binary classification problem, the Gini impurity measures how often a randomly chosen element from a set would be incorrectly classified if it were randomly labeled according to the distribution of labels in the subset. A Gini impurity of 0 indicates a perfectly pure node where all samples belong to a single class.
For a node with n samples, where p₁ is the proportion of class 1 samples, the Gini impurity is calculated as:
$$Gini = 1 - p_0^2 - p_1^2 = 2 \cdot p_0 \cdot p_1$$
When evaluating a potential split, we compute the weighted Gini impurity by taking the weighted average of the Gini impurities of the two resulting child nodes, where weights are proportional to the number of samples in each child:
$$Gini_{weighted} = \frac{n_{left}}{n} \cdot Gini_{left} + \frac{n_{right}}{n} \cdot Gini_{right}$$
Your Task: Implement a function that exhaustively searches through all possible feature-threshold combinations in a dataset and identifies the split that yields the minimum weighted Gini impurity. The function should:
X = np.array([[2.5], [3.5], [1.0], [4.0]])
y = np.array([0, 1, 0, 1])(0, 2.5)With a single feature (column 0), we evaluate all unique values as potential thresholds. Splitting at threshold 2.5 places samples with values ≤ 2.5 (indices 0 and 2 with labels [0, 0]) in the left child and samples with values > 2.5 (indices 1 and 3 with labels [1, 1]) in the right child. Both children are perfectly pure (Gini = 0), resulting in a weighted Gini impurity of 0—the best possible outcome.
X = np.array([[1.0, 5.0], [2.0, 6.0], [5.0, 1.0], [6.0, 2.0]])
y = np.array([0, 0, 1, 1])(0, 2.0)This dataset has two features. Scanning feature 0 first, splitting at threshold 2.0 places samples with feature_0 ≤ 2.0 (first two samples with labels [0, 0]) in the left child and samples with feature_0 > 2.0 (last two samples with labels [1, 1]) in the right child. Both partitions are perfectly pure, achieving minimum weighted Gini impurity of 0. Feature 0 at threshold 2.0 is returned as it's encountered first.
X = np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0], [2.0, 3.0, 4.0]])
y = np.array([0, 0, 1, 0])(0, 4.0)With three features and four samples, we search for the optimal split. Splitting on feature 0 at threshold 4.0 places samples with feature_0 ≤ 4.0 (samples 0, 1, 3 with labels [0, 0, 0]) in the left child and the sample with feature_0 > 4.0 (sample 2 with label [1]) in the right child. This creates two pure nodes: left contains only class 0, right contains only class 1, achieving the minimum possible impurity.
Constraints