Information Gain Split Finder (Medium) — Practice with Code Visualizer

In decision tree learning, the algorithm must determine the optimal way to partition data at each node. The fundamental question is: which feature and at what value should we split to best separate the classes? This problem asks you to implement the core split selection mechanism that powers decision tree construction using information gain based on entropy.

Conceptual Foundation

Entropy is a measure of impurity or uncertainty in a dataset. For a set of samples with binary or multi-class labels, entropy quantifies how mixed the classes are. A dataset where all samples belong to the same class has entropy of 0 (perfectly pure), while a dataset with equally distributed classes has maximum entropy (maximum uncertainty).

For a dataset with ( K ) classes, the entropy ( H(S) ) is calculated as:

$$H(S) = -\sum_{k=1}^{K} p_k \log_2(p_k)$$

where ( p_k ) is the proportion of samples belonging to class ( k ). By convention, ( 0 \log_2(0) = 0 ).

Information Gain measures how much a split reduces entropy. When we split a parent node ( S ) into child nodes ( S_L ) (left) and ( S_R ) (right), the information gain is:

$$IG(S, \text{split}) = H(S) - \left( \frac{|S_L|}{|S|} H(S_L) + \frac{|S_R|}{|S|} H(S_R) \right)$$

The weighted average of child entropies accounts for the relative sizes of the partitions.

Your Task

Implement a function that examines all possible binary splits across all features and identifies the split that maximizes information gain. For each feature:

Identify all unique values in that feature
Consider candidate split thresholds as the midpoints between consecutive sorted unique values
For each threshold, partition the data into left (≤ threshold) and right (> threshold) subsets
Calculate the resulting information gain
Track the best split found across all features and thresholds

Return a tuple containing:

The index of the best feature to split on
The optimal threshold value for the split
The information gain achieved by this split

Edge Cases

Handle these scenarios gracefully:

Pure nodes: If all samples already share the same label, no split can improve purity (return information gain of 0)
Single-value features: Features with only one unique value cannot be split
Empty partitions: Avoid splits that would create empty child nodes

This dataset has one feature with values [1, 2, 3, 4] and corresponding labels [0, 0, 1, 1].

Step 1: Calculate Parent Entropy

Class 0 appears 2 times (probability = 0.5)
Class 1 appears 2 times (probability = 0.5)
H(parent) = -0.5 × log₂(0.5) - 0.5 × log₂(0.5) = 0.5 + 0.5 = 1.0

Step 2: Evaluate Candidate Thresholds Candidate thresholds are midpoints: 1.5, 2.5, 3.5

Threshold 2.5:

Left partition (≤ 2.5): samples [1, 2] with labels [0, 0] → entropy = 0.0 (pure)
Right partition (> 2.5): samples [3, 4] with labels [1, 1] → entropy = 0.0 (pure)
Weighted child entropy = (2/4) × 0.0 + (2/4) × 0.0 = 0.0
Information gain = 1.0 - 0.0 = 1.0 (perfect split!)

This threshold achieves the maximum possible information gain of 1.0, perfectly separating the two classes.

This dataset has two features. We must evaluate splits on both features.

Parent Entropy Calculation:

Class 0: 2 samples (probability = 0.4)
Class 1: 3 samples (probability = 0.6)
H(parent) = -0.4 × log₂(0.4) - 0.6 × log₂(0.6) ≈ 0.971

Feature 0 Analysis (values: [1, 2, 3, 4, 5]):

Threshold 2.5 creates: Left [0, 0], Right [1, 1, 1]
- Left entropy = 0.0, Right entropy = 0.0
- Weighted = 0.0
- Info gain ≈ 0.971 (perfect separation!)

Feature 1 Analysis (values: [5, 4, 3, 2, 1]):

Feature 1's values are inversely correlated with labels
Best threshold would also yield similar gain, but Feature 0 at threshold 2.5 is identified first

The optimal split is Feature 0 at threshold 2.5 with information gain ≈ 0.971.

This 3-feature dataset demonstrates how the algorithm selects among multiple features.

Parent Entropy: With 2 samples each of class 0 and class 1, H(parent) = 1.0.

Feature 0 (values: [0, 1, 5, 6]):

Candidate thresholds: 0.5, 3.0, 5.5
Threshold 3.0: Left [0, 1] → labels [0, 0], Right [5, 6] → labels [1, 1]
Both partitions are pure → Information gain = 1.0

Features 1 and 2 follow similar patterns and also achieve perfect splits at their respective gap thresholds. Feature 0 at threshold 3.0 is returned as it's the first perfect split encountered.

The result is (0, 3.0, 1.0), representing feature index 0, split threshold 3.0, and maximum information gain of 1.0.

Conceptual Foundation

For a dataset with ( K ) classes, the entropy ( H(S) ) is calculated as:

$$H(S) = -\sum_{k=1}^{K} p_k \log_2(p_k)$$

where ( p_k ) is the proportion of samples belonging to class ( k ). By convention, ( 0 \log_2(0) = 0 ).

Information Gain measures how much a split reduces entropy. When we split a parent node ( S ) into child nodes ( S_L ) (left) and ( S_R ) (right), the information gain is:

$$IG(S, \text{split}) = H(S) - \left( \frac{|S_L|}{|S|} H(S_L) + \frac{|S_R|}{|S|} H(S_R) \right)$$

The weighted average of child entropies accounts for the relative sizes of the partitions.

Your Task

Implement a function that examines all possible binary splits across all features and identifies the split that maximizes information gain. For each feature:

Identify all unique values in that feature
Consider candidate split thresholds as the midpoints between consecutive sorted unique values
For each threshold, partition the data into left (≤ threshold) and right (> threshold) subsets
Calculate the resulting information gain
Track the best split found across all features and thresholds

Return a tuple containing:

The index of the best feature to split on
The optimal threshold value for the split
The information gain achieved by this split

Edge Cases

Handle these scenarios gracefully:

Pure nodes: If all samples already share the same label, no split can improve purity (return information gain of 0)
Single-value features: Features with only one unique value cannot be split
Empty partitions: Avoid splits that would create empty child nodes

This dataset has one feature with values [1, 2, 3, 4] and corresponding labels [0, 0, 1, 1].

Step 1: Calculate Parent Entropy

Class 0 appears 2 times (probability = 0.5)
Class 1 appears 2 times (probability = 0.5)
H(parent) = -0.5 × log₂(0.5) - 0.5 × log₂(0.5) = 0.5 + 0.5 = 1.0

Step 2: Evaluate Candidate Thresholds Candidate thresholds are midpoints: 1.5, 2.5, 3.5

Threshold 2.5:

Left partition (≤ 2.5): samples [1, 2] with labels [0, 0] → entropy = 0.0 (pure)
Right partition (> 2.5): samples [3, 4] with labels [1, 1] → entropy = 0.0 (pure)
Weighted child entropy = (2/4) × 0.0 + (2/4) × 0.0 = 0.0
Information gain = 1.0 - 0.0 = 1.0 (perfect split!)

This threshold achieves the maximum possible information gain of 1.0, perfectly separating the two classes.

This dataset has two features. We must evaluate splits on both features.

Parent Entropy Calculation:

Class 0: 2 samples (probability = 0.4)
Class 1: 3 samples (probability = 0.6)
H(parent) = -0.4 × log₂(0.4) - 0.6 × log₂(0.6) ≈ 0.971

Feature 0 Analysis (values: [1, 2, 3, 4, 5]):

Threshold 2.5 creates: Left [0, 0], Right [1, 1, 1]
- Left entropy = 0.0, Right entropy = 0.0
- Weighted = 0.0
- Info gain ≈ 0.971 (perfect separation!)

Feature 1 Analysis (values: [5, 4, 3, 2, 1]):

Feature 1's values are inversely correlated with labels
Best threshold would also yield similar gain, but Feature 0 at threshold 2.5 is identified first

The optimal split is Feature 0 at threshold 2.5 with information gain ≈ 0.971.

This 3-feature dataset demonstrates how the algorithm selects among multiple features.

Parent Entropy: With 2 samples each of class 0 and class 1, H(parent) = 1.0.

Feature 0 (values: [0, 1, 5, 6]):

Candidate thresholds: 0.5, 3.0, 5.5
Threshold 3.0: Left [0, 1] → labels [0, 0], Right [5, 6] → labels [1, 1]
Both partitions are pure → Information gain = 1.0

Features 1 and 2 follow similar patterns and also achieve perfect splits at their respective gap thresholds. Feature 0 at threshold 3.0 is returned as it's the first perfect split encountered.

The result is (0, 3.0, 1.0), representing feature index 0, split threshold 3.0, and maximum information gain of 1.0.

Information Gain Split Finder

Conceptual Foundation

Your Task

Edge Cases

Hints

Information Gain Split Finder

Conceptual Foundation

Your Task

Edge Cases

Hints