Machine LearningDecision Trees

Splitting Criteria

LevelIntermediate

Duration90 mins

TopicDecision Trees

1 / 5

Gini Impurity: The Measure of Node Purity

The Heart of Decision Tree Learning

At the core of every decision tree lies a deceptively simple question: How do we decide which feature to split on? The answer requires a formal measure of 'purity'—a way to quantify how mixed or homogeneous a set of examples is with respect to their class labels.

Gini impurity is one of the most widely used such measures. Named after the Italian statistician Corrado Gini (who introduced the Gini coefficient for income inequality in 1912), Gini impurity has become the default splitting criterion in CART (Classification and Regression Trees) and many of the most popular decision tree implementations, including scikit-learn.

But Gini impurity isn't just a convenient formula—it embodies deep principles about classification, probability, and optimal decision-making. Understanding it thoroughly will not only help you build better trees but also illuminate fundamental concepts in machine learning.

What You Will Learn

By the end of this page, you will understand Gini impurity from multiple perspectives: its intuitive meaning, mathematical definition, geometric interpretation, computational properties, and behavior in practice. You'll be able to derive it, compute it efficiently, and reason about when and why it works well.

The Intuition Behind Impurity

Before diving into formulas, let's build intuition. Imagine you have a basket of fruits and you want to characterize how 'mixed' it is.

Scenario 1: The basket contains 100 apples and 0 oranges.

If someone asks you to guess the fruit they pick blindly, you'll be right 100% of the time.
This basket is perfectly pure—there's no uncertainty.

Scenario 2: The basket contains 50 apples and 50 oranges.

If someone picks a fruit blindly, you can only guess correctly 50% of the time (at best).
This basket is maximally impure for a binary case—maximum uncertainty.

Scenario 3: The basket contains 80 apples and 20 oranges.

You'd guess 'apple' and be right 80% of the time.
This is somewhat pure, somewhere between the extremes.

We need a function that:

Returns 0 for perfectly pure sets (all same class)
Returns a maximum value for maximally mixed sets (uniform distribution)
Behaves smoothly between these extremes

Gini impurity is precisely such a function.

The Guessing Game Interpretation

Here's a powerful intuition: Gini impurity measures the probability that a randomly chosen element would be incorrectly classified if we assigned labels based on the class distribution. In other words, it quantifies expected classification error under random labeling.

Mathematical Definition

Let's formalize Gini impurity mathematically. Consider a node in a decision tree containing $n$ examples belonging to $K$ classes. Let $p_k$ denote the proportion of examples belonging to class $k$:

$$p_k = \frac{n_k}{n}$$

where $n_k$ is the number of examples of class $k$ and $\sum_{k=1}^{K} p_k = 1$.

The Gini impurity is defined as:

$$G = \sum_{k=1}^{K} p_k (1 - p_k) = \sum_{k=1}^{K} p_k - \sum_{k=1}^{K} p_k^2 = 1 - \sum_{k=1}^{K} p_k^2$$

The three equivalent forms above reveal different aspects of Gini impurity:

Form 1: $\sum_{k=1}^{K} p_k (1 - p_k)$

Interpretation: For each class $k$, $p_k$ is the probability of drawing an item of class $k$, and $(1-p_k)$ is the probability of misclassifying it. Sum over all classes.

Form 2: $1 - \sum_{k=1}^{K} p_k^2$

Interpretation: $\sum p_k^2$ is the probability that two randomly drawn items (with replacement) are the same class. Gini is one minus this.

Form 3 (for binary classification):

When $K=2$ with $p$ and $(1-p)$: $G = 2p(1-p)$

The Collision Probability Interpretation

The term $\sum_{k=1}^{K} p_k^2$ is known as the collision probability or Simpson's diversity index. It measures the chance that two randomly selected items belong to the same class. Gini impurity = 1 - collision probability, so it measures heterogeneity.

gini_impurity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
import numpy as np
from typing import Union
 
def gini_impurity(class_counts: Union[list, np.ndarray]) -> float:
    """
    Compute Gini impurity from class counts.
    
    The Gini impurity measures the probability that a randomly chosen
    element would be incorrectly classified under random labeling
    based on the class distribution.
    
    Args:
        class_counts: Array or list of counts for each class
        
    Returns:
        Gini impurity value in range [0, 1 - 1/K]
        
    Mathematical Definition:
        G = 1 - Σ(p_k²) where p_k = n_k / n
        
    Properties:
        - G = 0 for perfectly pure nodes (all samples same class)
        - G = 1 - 1/K for uniform distribution over K classes
        - For binary: maximum is 0.5 when p = 0.5
        
    Example:
        >>> gini_impurity([50, 50])  # Binary, balanced
        0.5
        >>> gini_impurity([100, 0])  # Pure node
        0.0
        >>> gini_impurity([30, 30, 30])  # 3-class, uniform
        0.6666666666666667
    """
    counts = np.array(class_counts, dtype=np.float64)
    n = counts.sum()
    
    if n == 0:
        return 0.0  # Empty node has no impurity
    
    # Compute class probabilities
    p = counts / n
    
    # Gini = 1 - Σ(p_k²)
    return 1.0 - np.sum(p ** 2)
 
 
def gini_impurity_binary(p: float) -> float:
    """
    Compute Gini impurity for binary classification.
    
    For two classes with proportions p and (1-p):
        G = 2p(1-p) = 1 - p² - (1-p)²
    
    Args:
        p: Proportion of positive class (0 ≤ p ≤ 1)
        
    Returns:
        Gini impurity value in range [0, 0.5]
    
    Note:
        Maximum of 0.5 occurs at p = 0.5 (most uncertain)
        Minimum of 0.0 occurs at p = 0 or p = 1 (pure)
    """
    return 2 * p * (1 - p)
 
 
# Demonstration
if __name__ == "__main__":
    print("Gini Impurity Examples:")
    print("=" * 50)
    
    # Pure node
    print(f"Pure [100, 0]:     G = {gini_impurity([100, 0]):.4f}")
    
    # Balanced binary
    print(f"Balanced [50, 50]: G = {gini_impurity([50, 50]):.4f}")
    
    # Imbalanced
    print(f"[80, 20]:          G = {gini_impurity([80, 20]):.4f}")
    print(f"[90, 10]:          G = {gini_impurity([90, 10]):.4f}")
    print(f"[99, 1]:           G = {gini_impurity([99, 1]):.4f}")
    
    # Multi-class
    print(f"\n3-class uniform [33, 33, 34]: G = {gini_impurity([33, 33, 34]):.4f}")
    print(f"3-class max = 1 - 1/3 = {1 - 1/3:.4f}")
    
    # Verify binary formula
    print(f"\nBinary formula verification:")
    for p in [0.0, 0.1, 0.3, 0.5, 0.7, 0.9, 1.0]:
        print(f"  p={p:.1f}: G = {gini_impurity_binary(p):.4f}")

Properties and Bounds

Understanding the mathematical properties of Gini impurity is essential for reasoning about decision tree behavior. Let's examine these rigorously.

Key Properties of Gini Impurity

•Non-negativity: $G \geq 0$ always. Since $0 \leq p_k \leq 1$, we have $p_k^2 \leq p_k$, so $\sum p_k^2 \leq \sum p_k = 1$.
•Minimum (purity): $G = 0$ when one class has $p_k = 1$ and all others have $p_j = 0$. The node is perfectly pure.
•Maximum (impurity): $G = 1 - 1/K$ when $p_k = 1/K$ for all classes (uniform distribution). For binary: $G_{max} = 0.5$.
•Symmetry: $G(p_1, p_2, ..., p_K) = G(p_{\sigma(1)}, p_{\sigma(2)}, ..., p_{\sigma(K)})$ for any permutation $\sigma$. Class ordering doesn't matter.
•Concavity: $G$ is a concave function of the probability distribution. This is crucial for proving that splits reduce impurity on average.
•Continuity: $G$ is continuous and differentiable everywhere in the probability simplex.

Gini Impurity Bounds by Number of Classes
Classes (K)	Maximum Gini	At Maximum Distribution
2	0.5	[0.5, 0.5]
3	0.667	[0.333, 0.333, 0.333]
4	0.75	[0.25, 0.25, 0.25, 0.25]
5	0.8	[0.2, 0.2, 0.2, 0.2, 0.2]
K	1 - 1/K	[1/K, 1/K, ..., 1/K]

Maximum Gini Approaches 1 as K → ∞

For many classes, the maximum Gini approaches 1. This means multi-class problems have a larger impurity 'range' to work with. However, achieving meaningful reductions becomes harder because the probability mass is spread thin.

Proof of Maximum:

We want to maximize $G = 1 - \sum_{k=1}^K p_k^2$ subject to $\sum_{k=1}^K p_k = 1$.

This is equivalent to minimizing $\sum p_k^2$ subject to the constraint. Using Lagrange multipliers:

$$\mathcal{L} = \sum p_k^2 + \lambda(\sum p_k - 1)$$

Taking derivatives: $\frac{\partial \mathcal{L}}{\partial p_k} = 2p_k + \lambda = 0$

This gives $p_k = -\lambda/2$ for all $k$. Since probabilities must be equal and sum to 1:

$$p_k = \frac{1}{K} \text{ for all } k$$

Substituting: $G_{max} = 1 - K \cdot (1/K)^2 = 1 - 1/K$ ∎

Geometric Interpretation

Gini impurity has beautiful geometric interpretations that provide additional insight.

Binary Classification Geometry:

For two classes with probabilities $p$ and $(1-p)$: $$G = 2p(1-p)$$

This is a parabola opening downward with:

Vertex at $p = 0.5$ (maximum impurity = 0.5)
Roots at $p = 0$ and $p = 1$ (zero impurity, pure nodes)

The parabola $y = 2p(1-p)$ is symmetric about $p = 0.5$, reflecting that swapping class labels doesn't change impurity.

Simplex Geometry (Multi-class):

For $K$ classes, the probability distributions live on the $(K-1)$-dimensional probability simplex: $$\Delta^{K-1} = {(p_1, ..., p_K) : p_k \geq 0, \sum p_k = 1}$$

Gini impurity can be rewritten as: $$G = 1 - ||\mathbf{p}||^2$$

where $||\mathbf{p}||^2 = \sum p_k^2$ is the squared Euclidean norm.

This means:

Corners of the simplex (pure distributions) have $||\mathbf{p}||^2 = 1$, so $G = 0$
Center of the simplex (uniform distribution) minimizes $||\mathbf{p}||^2 = 1/K$, so $G = 1 - 1/K$
Contours of constant Gini are spheres centered at the origin, intersected with the simplex

Distance from Purity

Geometrically, Gini impurity measures how far the probability vector is from the vertices (corners) of the simplex. The corners represent pure distributions, and Gini increases as you move toward the center.

gini_geometry.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
 
def visualize_gini_binary():
    """
    Visualize Gini impurity for binary classification.
    Shows the characteristic parabola G = 2p(1-p).
    """
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Left: Gini parabola
    p = np.linspace(0, 1, 100)
    gini = 2 * p * (1 - p)
    
    ax1 = axes[0]
    ax1.plot(p, gini, 'b-', linewidth=2.5, label='Gini = 2p(1-p)')
    ax1.fill_between(p, gini, alpha=0.2)
    
    # Mark key points
    ax1.scatter([0, 0.5, 1], [0, 0.5, 0], c='red', s=100, zorder=5)
    ax1.annotate('Pure\n(class 0)', (0, 0), xytext=(0.05, 0.1),
                fontsize=10, ha='left')
    ax1.annotate('Maximum\nImpurity', (0.5, 0.5), xytext=(0.52, 0.45),
                fontsize=10, ha='left')
    ax1.annotate('Pure\n(class 1)', (1, 0), xytext=(0.85, 0.1),
                fontsize=10, ha='left')
    
    ax1.set_xlabel('Proportion of Class 1 (p)', fontsize=12)
    ax1.set_ylabel('Gini Impurity', fontsize=12)
    ax1.set_title('Binary Gini Impurity: A Downward Parabola', fontsize=14)
    ax1.set_xlim(-0.05, 1.05)
    ax1.set_ylim(-0.05, 0.6)
    ax1.grid(True, alpha=0.3)
    ax1.legend(loc='upper right')
    
    # Right: Gradient showing rate of change
    ax2 = axes[1]
    gradient = 2 * (1 - 2*p)  # d/dp of 2p(1-p)
    
    ax2.plot(p, gradient, 'g-', linewidth=2.5, label="dG/dp = 2(1-2p)")
    ax2.axhline(y=0, color='black', linestyle='--', alpha=0.5)
    ax2.axvline(x=0.5, color='red', linestyle='--', alpha=0.5, label='p=0.5')
    
    ax2.set_xlabel('Proportion of Class 1 (p)', fontsize=12)
    ax2.set_ylabel('Rate of Change of Gini', fontsize=12)
    ax2.set_title('Gini Gradient: Steepest Near Purity', fontsize=14)
    ax2.grid(True, alpha=0.3)
    ax2.legend()
    
    plt.tight_layout()
    plt.savefig('gini_binary_geometry.png', dpi=150, bbox_inches='tight')
    plt.show()
 
 
def visualize_gini_ternary():
    """
    Visualize Gini impurity on the 3-class probability simplex.
    Creates a triangular heatmap with Gini contours.
    """
    # Generate points in the 2-simplex (triangle)
    resolution = 100
    points = []
    gini_values = []
    
    for i in range(resolution + 1):
        for j in range(resolution + 1 - i):
            k = resolution - i - j
            p1, p2, p3 = i/resolution, j/resolution, k/resolution
            points.append([p1, p2, p3])
            
            # Gini = 1 - sum(p^2)
            gini = 1 - (p1**2 + p2**2 + p3**2)
            gini_values.append(gini)
    
    points = np.array(points)
    gini_values = np.array(gini_values)
    
    # Transform to 2D coordinates for the triangle
    # Use barycentric to Cartesian transformation
    x = points[:, 1] + 0.5 * points[:, 2]
    y = (np.sqrt(3) / 2) * points[:, 2]
    
    fig, ax = plt.subplots(figsize=(10, 9))
    
    # Create triangular scatter plot with Gini coloring
    scatter = ax.scatter(x, y, c=gini_values, cmap='viridis', s=20, alpha=0.8)
    
    # Draw simplex boundary
    triangle = plt.Polygon([[0, 0], [1, 0], [0.5, np.sqrt(3)/2]], 
                          fill=False, edgecolor='black', linewidth=2)
    ax.add_patch(triangle)
    
    # Label vertices
    ax.annotate('Class 1\n(Pure, G=0)', (0, 0), xytext=(-0.15, -0.08),
               fontsize=11, ha='center', fontweight='bold')
    ax.annotate('Class 2\n(Pure, G=0)', (1, 0), xytext=(1.15, -0.08),
               fontsize=11, ha='center', fontweight='bold')
    ax.annotate('Class 3\n(Pure, G=0)', (0.5, np.sqrt(3)/2), 
               xytext=(0.5, np.sqrt(3)/2 + 0.1),
               fontsize=11, ha='center', fontweight='bold')
    
    # Mark center (max impurity)
    ax.scatter([0.5], [np.sqrt(3)/6], c='red', s=200, marker='*', 
              edgecolors='white', linewidths=2, zorder=5)
    ax.annotate('Center\n(G = 0.667)', (0.5, np.sqrt(3)/6),
               xytext=(0.65, 0.35), fontsize=10, ha='left',
               arrowprops=dict(arrowstyle='->', color='red'))
    
    cbar = plt.colorbar(scatter, ax=ax, shrink=0.6)
    cbar.set_label('Gini Impurity', fontsize=12)
    
    ax.set_xlim(-0.2, 1.2)
    ax.set_ylim(-0.2, 1.1)
    ax.set_aspect('equal')
    ax.axis('off')
    ax.set_title('Gini Impurity on the 3-Class Probability Simplex', 
                fontsize=14, pad=20)
    
    plt.savefig('gini_simplex.png', dpi=150, bbox_inches='tight')
    plt.show()
 
 
if __name__ == "__main__":
    visualize_gini_binary()
    visualize_gini_ternary()

Probabilistic Interpretation: Expected Misclassification

The most profound interpretation of Gini impurity is probabilistic. It represents the expected error rate under random labeling.

The Random Labeling Experiment:

Randomly select an element from the node (probability $p_k$ for class $k$)
Independently assign a random label to it based on the class distribution (probability $p_j$ for class $j$)
The probability of misclassification is:

$$P(\text{error}) = \sum_{k=1}^K p_k \cdot P(\text{wrong label} | \text{class } k) = \sum_{k=1}^K p_k \cdot (1 - p_k) = G$$

This is exactly Gini impurity!

Alternative Derivation (Two-Sample Collision):

Draw two samples independently with replacement from the node:

Probability both are class $k$: $p_k^2$
Probability they match (same class): $\sum_k p_k^2$
Probability they differ: $1 - \sum_k p_k^2 = G$

Gini impurity is the probability that two randomly chosen items belong to different classes.

Connection to Diversity Indices

In ecology, 1 - Σpk² is called the Simpson's Diversity Index and measures species diversity. In physics, it appears in the Herfindahl-Hirschman Index for market concentration. Gini impurity is a universal measure of heterogeneity.

Why This Matters for Decision Trees:

The expected misclassification interpretation explains why minimizing Gini impurity leads to good classification:

Pure nodes have zero expected error — If all items are the same class, random labeling based on the distribution assigns the correct label with probability 1.
Splits that reduce Gini reduce expected error — By separating classes into different branches, we reduce the 'confusion' in each child node.
Weighted average matters — When splitting, we compute weighted Gini, which is the expected error across all possible paths through the tree.

This probabilistic foundation gives Gini impurity a direct connection to classification performance, making it a principled choice for tree construction.

Gini Gain: Quantifying Split Quality

In decision trees, we don't just compute Gini impurity—we use it to evaluate potential splits. The Gini gain (or reduction in Gini impurity) measures how much a split improves node purity.

Gini Gain Definition:

For a parent node $S$ with Gini impurity $G(S)$, consider a binary split into children $S_L$ and $S_R$ with sizes $n_L$ and $n_R$:

$$\text{Gini Gain} = G(S) - \left( \frac{n_L}{n} G(S_L) + \frac{n_R}{n} G(S_R) \right)$$

The term in parentheses is the weighted average Gini of the children. Gini gain is always non-negative (by concavity of Gini), and larger gain means a better split.

The Splitting Algorithm:

For each feature:
- For each possible split point:
  - Partition examples into left and right children
  - Compute weighted average Gini of children
  - Calculate Gini gain
Choose the split (feature + threshold) with maximum Gini gain
Recurse on children

gini_split.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
import numpy as np
from typing import Tuple, List
 
def gini_impurity(class_counts: np.ndarray) -> float:
    """Compute Gini impurity from class counts."""
    n = class_counts.sum()
    if n == 0:
        return 0.0
    p = class_counts / n
    return 1.0 - np.sum(p ** 2)
 
 
def weighted_gini(left_counts: np.ndarray, right_counts: np.ndarray) -> float:
    """
    Compute weighted average Gini impurity for a split.
    
    This is what we minimize when selecting splits:
        weighted_G = (n_L/n) * G(left) + (n_R/n) * G(right)
    """
    n_left = left_counts.sum()
    n_right = right_counts.sum()
    n_total = n_left + n_right
    
    if n_total == 0:
        return 0.0
    
    weight_left = n_left / n_total
    weight_right = n_right / n_total
    
    return weight_left * gini_impurity(left_counts) + \
           weight_right * gini_impurity(right_counts)
 
 
def gini_gain(parent_counts: np.ndarray, 
              left_counts: np.ndarray, 
              right_counts: np.ndarray) -> float:
    """
    Compute Gini gain for a split.
    
    Gini Gain = G(parent) - weighted_G(children)
    
    Larger values indicate better splits that more effectively
    separate classes.
    """
    parent_gini = gini_impurity(parent_counts)
    children_weighted_gini = weighted_gini(left_counts, right_counts)
    return parent_gini - children_weighted_gini
 
 
def find_best_split(X: np.ndarray, y: np.ndarray, 
                    feature_idx: int) -> Tuple[float, float, float]:
    """
    Find the best split threshold for a single feature.
    
    Args:
        X: Feature matrix (n_samples, n_features)
        y: Labels (n_samples,)
        feature_idx: Index of feature to split on
        
    Returns:
        (best_threshold, best_gini_gain, weighted_gini_after_split)
    """
    n_samples = len(y)
    n_classes = len(np.unique(y))
    
    # Get feature values and sort
    feature_values = X[:, feature_idx]
    sorted_indices = np.argsort(feature_values)
    sorted_features = feature_values[sorted_indices]
    sorted_labels = y[sorted_indices]
    
    # Parent statistics
    parent_counts = np.bincount(y, minlength=n_classes)
    parent_gini = gini_impurity(parent_counts)
    
    best_gain = -np.inf
    best_threshold = None
    best_weighted_gini = None
    
    # Initialize left as empty, right as all
    left_counts = np.zeros(n_classes, dtype=np.int64)
    right_counts = parent_counts.copy()
    
    # Scan through all possible split points
    for i in range(n_samples - 1):
        # Move sample i from right to left
        c = sorted_labels[i]
        left_counts[c] += 1
        right_counts[c] -= 1
        
        # Skip if no actual threshold here (same value)
        if sorted_features[i] == sorted_features[i + 1]:
            continue
        
        # Compute gain for this split
        w_gini = weighted_gini(left_counts, right_counts)
        gain = parent_gini - w_gini
        
        if gain > best_gain:
            best_gain = gain
            best_threshold = (sorted_features[i] + sorted_features[i + 1]) / 2
            best_weighted_gini = w_gini
    
    return best_threshold, best_gain, best_weighted_gini
 
 
# Demonstration
if __name__ == "__main__":
    print("Gini Split Demonstration")
    print("=" * 60)
    
    # Example: Parent node with two classes
    parent = np.array([50, 50])  # Balanced binary
    print(f"\nParent: {parent}")
    print(f"Parent Gini: {gini_impurity(parent):.4f}")
    
    # Perfect split
    left_perfect = np.array([50, 0])
    right_perfect = np.array([0, 50])
    print(f"\nPerfect split:")
    print(f"  Left {left_perfect}: G = {gini_impurity(left_perfect):.4f}")
    print(f"  Right {right_perfect}: G = {gini_impurity(right_perfect):.4f}")
    print(f"  Weighted Gini: {weighted_gini(left_perfect, right_perfect):.4f}")
    print(f"  Gini Gain: {gini_gain(parent, left_perfect, right_perfect):.4f}")
    
    # Imperfect split
    left_imperfect = np.array([40, 10])
    right_imperfect = np.array([10, 40])
    print(f"\nImperfect split:")
    print(f"  Left {left_imperfect}: G = {gini_impurity(left_imperfect):.4f}")
    print(f"  Right {right_imperfect}: G = {gini_impurity(right_imperfect):.4f}")
    print(f"  Weighted Gini: {weighted_gini(left_imperfect, right_imperfect):.4f}")
    print(f"  Gini Gain: {gini_gain(parent, left_imperfect, right_imperfect):.4f}")
    
    # Bad split (unbalanced)
    left_bad = np.array([25, 25])
    right_bad = np.array([25, 25])
    print(f"\nNo-change split:")
    print(f"  Left {left_bad}: G = {gini_impurity(left_bad):.4f}")
    print(f"  Right {right_bad}: G = {gini_impurity(right_bad):.4f}")
    print(f"  Weighted Gini: {weighted_gini(left_bad, right_bad):.4f}")
    print(f"  Gini Gain: {gini_gain(parent, left_bad, right_bad):.4f}")

Why Weighted Average?

We weight by node size because larger nodes contribute more to overall classification error. A split that purifies a node with 1000 samples is more valuable than one that purifies a node with 10 samples.

Computational Efficiency

One of Gini impurity's practical advantages is its computational efficiency. Unlike entropy (which requires logarithms), Gini only needs basic arithmetic operations.

Complexity Analysis:

Computing Gini for a node: $O(K)$ where $K$ is the number of classes
- Sum of squared probabilities is a simple loop
Evaluating all splits for one feature:
- Sort values: $O(n \log n)$
- Scan through splits: $O(n)$
- Update counts per split: $O(1)$ (incremental)
- Total: $O(n \log n + n \cdot K) = O(n \log n)$ typically
Finding best split across all features: $O(d \cdot n \log n)$
- Where $d$ is the number of features

The Incremental Update Trick:

When scanning sorted values to find the best threshold, we don't recompute Gini from scratch. Instead:

Start with all samples in the right child
Move one sample at a time from right to left
Increment/decrement class counts (constant time)
Recompute Gini from updated counts ($O(K)$, dominated by the scan)

efficient_gini.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
import numpy as np
from numba import njit
 
@njit
def gini_from_counts_fast(counts: np.ndarray, total: int) -> float:
    """
    Ultra-fast Gini computation using just additions and multiplications.
    
    G = 1 - Σ(n_k/n)² = 1 - Σ(n_k²)/n² = (n² - Σn_k²) / n²
    
    This avoids division until the end.
    """
    if total == 0:
        return 0.0
    
    sum_sq = 0
    for c in counts:
        sum_sq += c * c
    
    # G = (n² - Σn_k²) / n²
    n_sq = total * total
    return (n_sq - sum_sq) / n_sq
 
 
@njit
def find_best_split_fast(sorted_labels: np.ndarray, 
                         sorted_values: np.ndarray,
                         n_classes: int) -> tuple:
    """
    Efficient split finding with O(n) scan after sorting.
    
    Uses incremental count updates to avoid recomputing Gini from scratch.
    """
    n = len(sorted_labels)
    
    # Initialize counts
    left_counts = np.zeros(n_classes, dtype=np.int64)
    right_counts = np.zeros(n_classes, dtype=np.int64)
    
    for label in sorted_labels:
        right_counts[label] += 1
    
    best_gini = 1e10
    best_idx = -1
    left_total = 0
    right_total = n
    
    for i in range(n - 1):
        # Move sample i from right to left
        label = sorted_labels[i]
        left_counts[label] += 1
        right_counts[label] -= 1
        left_total += 1
        right_total -= 1
        
        # Skip if not a valid split point
        if sorted_values[i] >= sorted_values[i + 1]:
            continue
        
        # Compute weighted Gini
        left_gini = gini_from_counts_fast(left_counts, left_total)
        right_gini = gini_from_counts_fast(right_counts, right_total)
        
        weighted = (left_total * left_gini + right_total * right_gini) / n
        
        if weighted < best_gini:
            best_gini = weighted
            best_idx = i
    
    return best_idx, best_gini
 
 
# Comparison: Gini vs log operations (entropy)
def benchmark_operations():
    """
    Compare computational cost of Gini vs entropy calculations.
    """
    import time
    
    n_trials = 1000000
    probs = np.random.dirichlet(np.ones(10), n_trials)
    
    # Time Gini: just squares and sums
    start = time.perf_counter()
    gini_vals = 1 - np.sum(probs ** 2, axis=1)
    gini_time = time.perf_counter() - start
    
    # Time Entropy: requires log
    start = time.perf_counter()
    # Add small epsilon to avoid log(0)
    safe_probs = np.clip(probs, 1e-10, 1)
    entropy_vals = -np.sum(safe_probs * np.log2(safe_probs), axis=1)
    entropy_time = time.perf_counter() - start
    
    print(f"\nBenchmark: {n_trials:,} 10-class probability vectors")
    print(f"Gini time:    {gini_time:.4f}s")
    print(f"Entropy time: {entropy_time:.4f}s")
    print(f"Ratio: Entropy is {entropy_time/gini_time:.2f}x slower")
 
 
if __name__ == "__main__":
    benchmark_operations()

Hardware Considerations

Gini's simpler arithmetic (no transcendental functions) also benefits from modern CPU optimizations. SIMD instructions can evaluate Gini across multiple splits simultaneously, making it 2-4x faster than entropy in typical tree implementations.

Gini Compared to Other Impurity Measures

While Gini impurity is widely used, it's one of several possible impurity measures. Let's briefly compare it to the main alternatives, with detailed treatment reserved for later pages.

The Three Common Impurity Measures:

Measure	Formula	Max Value (Binary)
Gini	$1 - \sum p_k^2$	0.5
Entropy	$-\sum p_k \log_2 p_k$	1.0
Misclassification	$1 - \max_k p_k$	0.5

Quick Intuition:

Gini measures the probability of misclassification under random labeling
Entropy measures the average 'surprise' or information content
Misclassification measures the error rate of the majority-vote classifier

All three are:

Minimized (= 0) at pure distributions
Maximized at the uniform distribution
Symmetric in class relabeling
Concave functions of probability

However, they have different curvatures and sensitivities, leading to different splitting behavior in some cases.

Comparison of Impurity Measures
Property	Gini	Entropy	Misclassification
Computation	Very fast (squares)	Slower (logarithms)	Very fast (max)
Curvature	Moderate	High near boundaries	Piecewise linear
Sensitivity	Balanced	More sensitive to small classes	Insensitive to distribution shape
Splittability	Can always improve with splits	Can always improve with splits	May get stuck (not strictly concave)
Default in CART	Yes	No	No
Default in ID3/C4.5	No	Yes	No

Practical Takeaway

In practice, Gini and entropy produce very similar trees in most cases (studies show ~2% difference in tree structure). Gini is preferred when computational speed matters; entropy is preferred when class imbalance is extreme or information-theoretic justification is desired.

Summary: Mastering Gini Impurity

We've explored Gini impurity comprehensively—from intuition to mathematical rigor to implementation. Let's consolidate the key insights:

Key Takeaways

•Definition: $G = 1 - \sum p_k^2$ measures how 'mixed' a node is with respect to class labels.
•Bounds: Ranges from 0 (pure) to $1 - 1/K$ (uniform distribution over K classes).
•Probabilistic meaning: Expected misclassification rate under random labeling; probability two random samples differ.
•Geometric view: Gini measures distance from the corners of the probability simplex (pure distributions).
•Gini Gain: $G(parent) - \text{weighted}(G(children))$ quantifies split quality.
•Computational efficiency: Only squares and sums—no logarithms—making it faster than entropy.
•Practice: Default criterion in CART, scikit-learn, and many tree-based methods.
•Equivalence: Produces very similar trees to entropy in most practical cases.

What's Next:

Now that we've mastered Gini impurity, the next page explores Entropy and Information Gain—the alternative criterion rooted in information theory. We'll see how Claude Shannon's revolutionary ideas from communication theory provide a different but equally valid perspective on node purity.

Page Complete

You now have a deep, multi-faceted understanding of Gini impurity. This foundation will serve you well as we explore entropy and compare these criteria systematically. Remember: Gini's power lies in its simplicity and probabilistic interpretation.

1 / 5

Loading learning content...

Machine LearningDecision Trees

Splitting Criteria

LevelIntermediate

Duration90 mins

TopicDecision Trees

1 / 5

Gini Impurity: The Measure of Node Purity

The Heart of Decision Tree Learning

What You Will Learn

The Intuition Behind Impurity

Before diving into formulas, let's build intuition. Imagine you have a basket of fruits and you want to characterize how 'mixed' it is.

Scenario 1: The basket contains 100 apples and 0 oranges.

If someone asks you to guess the fruit they pick blindly, you'll be right 100% of the time.
This basket is perfectly pure—there's no uncertainty.

Scenario 2: The basket contains 50 apples and 50 oranges.

If someone picks a fruit blindly, you can only guess correctly 50% of the time (at best).
This basket is maximally impure for a binary case—maximum uncertainty.

Scenario 3: The basket contains 80 apples and 20 oranges.

You'd guess 'apple' and be right 80% of the time.
This is somewhat pure, somewhere between the extremes.

We need a function that:

Returns 0 for perfectly pure sets (all same class)
Returns a maximum value for maximally mixed sets (uniform distribution)
Behaves smoothly between these extremes

Gini impurity is precisely such a function.

The Guessing Game Interpretation

Mathematical Definition

Let's formalize Gini impurity mathematically. Consider a node in a decision tree containing $n$ examples belonging to $K$ classes. Let $p_k$ denote the proportion of examples belonging to class $k$:

$$p_k = \frac{n_k}{n}$$

where $n_k$ is the number of examples of class $k$ and $\sum_{k=1}^{K} p_k = 1$.

The Gini impurity is defined as:

$$G = \sum_{k=1}^{K} p_k (1 - p_k) = \sum_{k=1}^{K} p_k - \sum_{k=1}^{K} p_k^2 = 1 - \sum_{k=1}^{K} p_k^2$$

The three equivalent forms above reveal different aspects of Gini impurity:

Form 1: $\sum_{k=1}^{K} p_k (1 - p_k)$

Interpretation: For each class $k$, $p_k$ is the probability of drawing an item of class $k$, and $(1-p_k)$ is the probability of misclassifying it. Sum over all classes.

Form 2: $1 - \sum_{k=1}^{K} p_k^2$

Interpretation: $\sum p_k^2$ is the probability that two randomly drawn items (with replacement) are the same class. Gini is one minus this.

Form 3 (for binary classification):

When $K=2$ with $p$ and $(1-p)$: $G = 2p(1-p)$

The Collision Probability Interpretation

gini_impurity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
import numpy as np
from typing import Union
 
def gini_impurity(class_counts: Union[list, np.ndarray]) -> float:
    """
    Compute Gini impurity from class counts.
    
    The Gini impurity measures the probability that a randomly chosen
    element would be incorrectly classified under random labeling
    based on the class distribution.
    
    Args:
        class_counts: Array or list of counts for each class
        
    Returns:
        Gini impurity value in range [0, 1 - 1/K]
        
    Mathematical Definition:
        G = 1 - Σ(p_k²) where p_k = n_k / n
        
    Properties:
        - G = 0 for perfectly pure nodes (all samples same class)
        - G = 1 - 1/K for uniform distribution over K classes
        - For binary: maximum is 0.5 when p = 0.5
        
    Example:
        >>> gini_impurity([50, 50])  # Binary, balanced
        0.5
        >>> gini_impurity([100, 0])  # Pure node
        0.0
        >>> gini_impurity([30, 30, 30])  # 3-class, uniform
        0.6666666666666667
    """
    counts = np.array(class_counts, dtype=np.float64)
    n = counts.sum()
    
    if n == 0:
        return 0.0  # Empty node has no impurity
    
    # Compute class probabilities
    p = counts / n
    
    # Gini = 1 - Σ(p_k²)
    return 1.0 - np.sum(p ** 2)
 
 
def gini_impurity_binary(p: float) -> float:
    """
    Compute Gini impurity for binary classification.
    
    For two classes with proportions p and (1-p):
        G = 2p(1-p) = 1 - p² - (1-p)²
    
    Args:
        p: Proportion of positive class (0 ≤ p ≤ 1)
        
    Returns:
        Gini impurity value in range [0, 0.5]
    
    Note:
        Maximum of 0.5 occurs at p = 0.5 (most uncertain)
        Minimum of 0.0 occurs at p = 0 or p = 1 (pure)
    """
    return 2 * p * (1 - p)
 
 
# Demonstration
if __name__ == "__main__":
    print("Gini Impurity Examples:")
    print("=" * 50)
    
    # Pure node
    print(f"Pure [100, 0]:     G = {gini_impurity([100, 0]):.4f}")
    
    # Balanced binary
    print(f"Balanced [50, 50]: G = {gini_impurity([50, 50]):.4f}")
    
    # Imbalanced
    print(f"[80, 20]:          G = {gini_impurity([80, 20]):.4f}")
    print(f"[90, 10]:          G = {gini_impurity([90, 10]):.4f}")
    print(f"[99, 1]:           G = {gini_impurity([99, 1]):.4f}")
    
    # Multi-class
    print(f"\n3-class uniform [33, 33, 34]: G = {gini_impurity([33, 33, 34]):.4f}")
    print(f"3-class max = 1 - 1/3 = {1 - 1/3:.4f}")
    
    # Verify binary formula
    print(f"\nBinary formula verification:")
    for p in [0.0, 0.1, 0.3, 0.5, 0.7, 0.9, 1.0]:
        print(f"  p={p:.1f}: G = {gini_impurity_binary(p):.4f}")

Properties and Bounds

Understanding the mathematical properties of Gini impurity is essential for reasoning about decision tree behavior. Let's examine these rigorously.

Key Properties of Gini Impurity

•Non-negativity: $G \geq 0$ always. Since $0 \leq p_k \leq 1$, we have $p_k^2 \leq p_k$, so $\sum p_k^2 \leq \sum p_k = 1$.
•Minimum (purity): $G = 0$ when one class has $p_k = 1$ and all others have $p_j = 0$. The node is perfectly pure.
•Maximum (impurity): $G = 1 - 1/K$ when $p_k = 1/K$ for all classes (uniform distribution). For binary: $G_{max} = 0.5$.
•Symmetry: $G(p_1, p_2, ..., p_K) = G(p_{\sigma(1)}, p_{\sigma(2)}, ..., p_{\sigma(K)})$ for any permutation $\sigma$. Class ordering doesn't matter.
•Concavity: $G$ is a concave function of the probability distribution. This is crucial for proving that splits reduce impurity on average.
•Continuity: $G$ is continuous and differentiable everywhere in the probability simplex.

Gini Impurity Bounds by Number of Classes
Classes (K)	Maximum Gini	At Maximum Distribution
2	0.5	[0.5, 0.5]
3	0.667	[0.333, 0.333, 0.333]
4	0.75	[0.25, 0.25, 0.25, 0.25]
5	0.8	[0.2, 0.2, 0.2, 0.2, 0.2]
K	1 - 1/K	[1/K, 1/K, ..., 1/K]

Maximum Gini Approaches 1 as K → ∞

Proof of Maximum:

We want to maximize $G = 1 - \sum_{k=1}^K p_k^2$ subject to $\sum_{k=1}^K p_k = 1$.

This is equivalent to minimizing $\sum p_k^2$ subject to the constraint. Using Lagrange multipliers:

$$\mathcal{L} = \sum p_k^2 + \lambda(\sum p_k - 1)$$

Taking derivatives: $\frac{\partial \mathcal{L}}{\partial p_k} = 2p_k + \lambda = 0$

This gives $p_k = -\lambda/2$ for all $k$. Since probabilities must be equal and sum to 1:

$$p_k = \frac{1}{K} \text{ for all } k$$

Substituting: $G_{max} = 1 - K \cdot (1/K)^2 = 1 - 1/K$ ∎

Geometric Interpretation

Gini impurity has beautiful geometric interpretations that provide additional insight.

Binary Classification Geometry:

For two classes with probabilities $p$ and $(1-p)$: $$G = 2p(1-p)$$

This is a parabola opening downward with:

Vertex at $p = 0.5$ (maximum impurity = 0.5)
Roots at $p = 0$ and $p = 1$ (zero impurity, pure nodes)

The parabola $y = 2p(1-p)$ is symmetric about $p = 0.5$, reflecting that swapping class labels doesn't change impurity.

Simplex Geometry (Multi-class):

For $K$ classes, the probability distributions live on the $(K-1)$-dimensional probability simplex: $$\Delta^{K-1} = {(p_1, ..., p_K) : p_k \geq 0, \sum p_k = 1}$$

Gini impurity can be rewritten as: $$G = 1 - ||\mathbf{p}||^2$$

where $||\mathbf{p}||^2 = \sum p_k^2$ is the squared Euclidean norm.

This means:

Corners of the simplex (pure distributions) have $||\mathbf{p}||^2 = 1$, so $G = 0$
Center of the simplex (uniform distribution) minimizes $||\mathbf{p}||^2 = 1/K$, so $G = 1 - 1/K$
Contours of constant Gini are spheres centered at the origin, intersected with the simplex

Distance from Purity

gini_geometry.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
 
def visualize_gini_binary():
    """
    Visualize Gini impurity for binary classification.
    Shows the characteristic parabola G = 2p(1-p).
    """
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Left: Gini parabola
    p = np.linspace(0, 1, 100)
    gini = 2 * p * (1 - p)
    
    ax1 = axes[0]
    ax1.plot(p, gini, 'b-', linewidth=2.5, label='Gini = 2p(1-p)')
    ax1.fill_between(p, gini, alpha=0.2)
    
    # Mark key points
    ax1.scatter([0, 0.5, 1], [0, 0.5, 0], c='red', s=100, zorder=5)
    ax1.annotate('Pure\n(class 0)', (0, 0), xytext=(0.05, 0.1),
                fontsize=10, ha='left')
    ax1.annotate('Maximum\nImpurity', (0.5, 0.5), xytext=(0.52, 0.45),
                fontsize=10, ha='left')
    ax1.annotate('Pure\n(class 1)', (1, 0), xytext=(0.85, 0.1),
                fontsize=10, ha='left')
    
    ax1.set_xlabel('Proportion of Class 1 (p)', fontsize=12)
    ax1.set_ylabel('Gini Impurity', fontsize=12)
    ax1.set_title('Binary Gini Impurity: A Downward Parabola', fontsize=14)
    ax1.set_xlim(-0.05, 1.05)
    ax1.set_ylim(-0.05, 0.6)
    ax1.grid(True, alpha=0.3)
    ax1.legend(loc='upper right')
    
    # Right: Gradient showing rate of change
    ax2 = axes[1]
    gradient = 2 * (1 - 2*p)  # d/dp of 2p(1-p)
    
    ax2.plot(p, gradient, 'g-', linewidth=2.5, label="dG/dp = 2(1-2p)")
    ax2.axhline(y=0, color='black', linestyle='--', alpha=0.5)
    ax2.axvline(x=0.5, color='red', linestyle='--', alpha=0.5, label='p=0.5')
    
    ax2.set_xlabel('Proportion of Class 1 (p)', fontsize=12)
    ax2.set_ylabel('Rate of Change of Gini', fontsize=12)
    ax2.set_title('Gini Gradient: Steepest Near Purity', fontsize=14)
    ax2.grid(True, alpha=0.3)
    ax2.legend()
    
    plt.tight_layout()
    plt.savefig('gini_binary_geometry.png', dpi=150, bbox_inches='tight')
    plt.show()
 
 
def visualize_gini_ternary():
    """
    Visualize Gini impurity on the 3-class probability simplex.
    Creates a triangular heatmap with Gini contours.
    """
    # Generate points in the 2-simplex (triangle)
    resolution = 100
    points = []
    gini_values = []
    
    for i in range(resolution + 1):
        for j in range(resolution + 1 - i):
            k = resolution - i - j
            p1, p2, p3 = i/resolution, j/resolution, k/resolution
            points.append([p1, p2, p3])
            
            # Gini = 1 - sum(p^2)
            gini = 1 - (p1**2 + p2**2 + p3**2)
            gini_values.append(gini)
    
    points = np.array(points)
    gini_values = np.array(gini_values)
    
    # Transform to 2D coordinates for the triangle
    # Use barycentric to Cartesian transformation
    x = points[:, 1] + 0.5 * points[:, 2]
    y = (np.sqrt(3) / 2) * points[:, 2]
    
    fig, ax = plt.subplots(figsize=(10, 9))
    
    # Create triangular scatter plot with Gini coloring
    scatter = ax.scatter(x, y, c=gini_values, cmap='viridis', s=20, alpha=0.8)
    
    # Draw simplex boundary
    triangle = plt.Polygon([[0, 0], [1, 0], [0.5, np.sqrt(3)/2]], 
                          fill=False, edgecolor='black', linewidth=2)
    ax.add_patch(triangle)
    
    # Label vertices
    ax.annotate('Class 1\n(Pure, G=0)', (0, 0), xytext=(-0.15, -0.08),
               fontsize=11, ha='center', fontweight='bold')
    ax.annotate('Class 2\n(Pure, G=0)', (1, 0), xytext=(1.15, -0.08),
               fontsize=11, ha='center', fontweight='bold')
    ax.annotate('Class 3\n(Pure, G=0)', (0.5, np.sqrt(3)/2), 
               xytext=(0.5, np.sqrt(3)/2 + 0.1),
               fontsize=11, ha='center', fontweight='bold')
    
    # Mark center (max impurity)
    ax.scatter([0.5], [np.sqrt(3)/6], c='red', s=200, marker='*', 
              edgecolors='white', linewidths=2, zorder=5)
    ax.annotate('Center\n(G = 0.667)', (0.5, np.sqrt(3)/6),
               xytext=(0.65, 0.35), fontsize=10, ha='left',
               arrowprops=dict(arrowstyle='->', color='red'))
    
    cbar = plt.colorbar(scatter, ax=ax, shrink=0.6)
    cbar.set_label('Gini Impurity', fontsize=12)
    
    ax.set_xlim(-0.2, 1.2)
    ax.set_ylim(-0.2, 1.1)
    ax.set_aspect('equal')
    ax.axis('off')
    ax.set_title('Gini Impurity on the 3-Class Probability Simplex', 
                fontsize=14, pad=20)
    
    plt.savefig('gini_simplex.png', dpi=150, bbox_inches='tight')
    plt.show()
 
 
if __name__ == "__main__":
    visualize_gini_binary()
    visualize_gini_ternary()

Probabilistic Interpretation: Expected Misclassification

The most profound interpretation of Gini impurity is probabilistic. It represents the expected error rate under random labeling.

The Random Labeling Experiment:

Randomly select an element from the node (probability $p_k$ for class $k$)
Independently assign a random label to it based on the class distribution (probability $p_j$ for class $j$)
The probability of misclassification is:

$$P(\text{error}) = \sum_{k=1}^K p_k \cdot P(\text{wrong label} | \text{class } k) = \sum_{k=1}^K p_k \cdot (1 - p_k) = G$$

This is exactly Gini impurity!

Alternative Derivation (Two-Sample Collision):

Draw two samples independently with replacement from the node:

Probability both are class $k$: $p_k^2$
Probability they match (same class): $\sum_k p_k^2$
Probability they differ: $1 - \sum_k p_k^2 = G$

Gini impurity is the probability that two randomly chosen items belong to different classes.

Connection to Diversity Indices

Why This Matters for Decision Trees:

The expected misclassification interpretation explains why minimizing Gini impurity leads to good classification:

Pure nodes have zero expected error — If all items are the same class, random labeling based on the distribution assigns the correct label with probability 1.
Splits that reduce Gini reduce expected error — By separating classes into different branches, we reduce the 'confusion' in each child node.
Weighted average matters — When splitting, we compute weighted Gini, which is the expected error across all possible paths through the tree.

This probabilistic foundation gives Gini impurity a direct connection to classification performance, making it a principled choice for tree construction.

Gini Gain: Quantifying Split Quality

In decision trees, we don't just compute Gini impurity—we use it to evaluate potential splits. The Gini gain (or reduction in Gini impurity) measures how much a split improves node purity.

Gini Gain Definition:

For a parent node $S$ with Gini impurity $G(S)$, consider a binary split into children $S_L$ and $S_R$ with sizes $n_L$ and $n_R$:

$$\text{Gini Gain} = G(S) - \left( \frac{n_L}{n} G(S_L) + \frac{n_R}{n} G(S_R) \right)$$

The term in parentheses is the weighted average Gini of the children. Gini gain is always non-negative (by concavity of Gini), and larger gain means a better split.

The Splitting Algorithm:

For each feature:
- For each possible split point:
  - Partition examples into left and right children
  - Compute weighted average Gini of children
  - Calculate Gini gain
Choose the split (feature + threshold) with maximum Gini gain
Recurse on children

gini_split.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
import numpy as np
from typing import Tuple, List
 
def gini_impurity(class_counts: np.ndarray) -> float:
    """Compute Gini impurity from class counts."""
    n = class_counts.sum()
    if n == 0:
        return 0.0
    p = class_counts / n
    return 1.0 - np.sum(p ** 2)
 
 
def weighted_gini(left_counts: np.ndarray, right_counts: np.ndarray) -> float:
    """
    Compute weighted average Gini impurity for a split.
    
    This is what we minimize when selecting splits:
        weighted_G = (n_L/n) * G(left) + (n_R/n) * G(right)
    """
    n_left = left_counts.sum()
    n_right = right_counts.sum()
    n_total = n_left + n_right
    
    if n_total == 0:
        return 0.0
    
    weight_left = n_left / n_total
    weight_right = n_right / n_total
    
    return weight_left * gini_impurity(left_counts) + \
           weight_right * gini_impurity(right_counts)
 
 
def gini_gain(parent_counts: np.ndarray, 
              left_counts: np.ndarray, 
              right_counts: np.ndarray) -> float:
    """
    Compute Gini gain for a split.
    
    Gini Gain = G(parent) - weighted_G(children)
    
    Larger values indicate better splits that more effectively
    separate classes.
    """
    parent_gini = gini_impurity(parent_counts)
    children_weighted_gini = weighted_gini(left_counts, right_counts)
    return parent_gini - children_weighted_gini
 
 
def find_best_split(X: np.ndarray, y: np.ndarray, 
                    feature_idx: int) -> Tuple[float, float, float]:
    """
    Find the best split threshold for a single feature.
    
    Args:
        X: Feature matrix (n_samples, n_features)
        y: Labels (n_samples,)
        feature_idx: Index of feature to split on
        
    Returns:
        (best_threshold, best_gini_gain, weighted_gini_after_split)
    """
    n_samples = len(y)
    n_classes = len(np.unique(y))
    
    # Get feature values and sort
    feature_values = X[:, feature_idx]
    sorted_indices = np.argsort(feature_values)
    sorted_features = feature_values[sorted_indices]
    sorted_labels = y[sorted_indices]
    
    # Parent statistics
    parent_counts = np.bincount(y, minlength=n_classes)
    parent_gini = gini_impurity(parent_counts)
    
    best_gain = -np.inf
    best_threshold = None
    best_weighted_gini = None
    
    # Initialize left as empty, right as all
    left_counts = np.zeros(n_classes, dtype=np.int64)
    right_counts = parent_counts.copy()
    
    # Scan through all possible split points
    for i in range(n_samples - 1):
        # Move sample i from right to left
        c = sorted_labels[i]
        left_counts[c] += 1
        right_counts[c] -= 1
        
        # Skip if no actual threshold here (same value)
        if sorted_features[i] == sorted_features[i + 1]:
            continue
        
        # Compute gain for this split
        w_gini = weighted_gini(left_counts, right_counts)
        gain = parent_gini - w_gini
        
        if gain > best_gain:
            best_gain = gain
            best_threshold = (sorted_features[i] + sorted_features[i + 1]) / 2
            best_weighted_gini = w_gini
    
    return best_threshold, best_gain, best_weighted_gini
 
 
# Demonstration
if __name__ == "__main__":
    print("Gini Split Demonstration")
    print("=" * 60)
    
    # Example: Parent node with two classes
    parent = np.array([50, 50])  # Balanced binary
    print(f"\nParent: {parent}")
    print(f"Parent Gini: {gini_impurity(parent):.4f}")
    
    # Perfect split
    left_perfect = np.array([50, 0])
    right_perfect = np.array([0, 50])
    print(f"\nPerfect split:")
    print(f"  Left {left_perfect}: G = {gini_impurity(left_perfect):.4f}")
    print(f"  Right {right_perfect}: G = {gini_impurity(right_perfect):.4f}")
    print(f"  Weighted Gini: {weighted_gini(left_perfect, right_perfect):.4f}")
    print(f"  Gini Gain: {gini_gain(parent, left_perfect, right_perfect):.4f}")
    
    # Imperfect split
    left_imperfect = np.array([40, 10])
    right_imperfect = np.array([10, 40])
    print(f"\nImperfect split:")
    print(f"  Left {left_imperfect}: G = {gini_impurity(left_imperfect):.4f}")
    print(f"  Right {right_imperfect}: G = {gini_impurity(right_imperfect):.4f}")
    print(f"  Weighted Gini: {weighted_gini(left_imperfect, right_imperfect):.4f}")
    print(f"  Gini Gain: {gini_gain(parent, left_imperfect, right_imperfect):.4f}")
    
    # Bad split (unbalanced)
    left_bad = np.array([25, 25])
    right_bad = np.array([25, 25])
    print(f"\nNo-change split:")
    print(f"  Left {left_bad}: G = {gini_impurity(left_bad):.4f}")
    print(f"  Right {right_bad}: G = {gini_impurity(right_bad):.4f}")
    print(f"  Weighted Gini: {weighted_gini(left_bad, right_bad):.4f}")
    print(f"  Gini Gain: {gini_gain(parent, left_bad, right_bad):.4f}")

Why Weighted Average?

Computational Efficiency

One of Gini impurity's practical advantages is its computational efficiency. Unlike entropy (which requires logarithms), Gini only needs basic arithmetic operations.

Complexity Analysis:

Computing Gini for a node: $O(K)$ where $K$ is the number of classes
- Sum of squared probabilities is a simple loop
Evaluating all splits for one feature:
- Sort values: $O(n \log n)$
- Scan through splits: $O(n)$
- Update counts per split: $O(1)$ (incremental)
- Total: $O(n \log n + n \cdot K) = O(n \log n)$ typically
Finding best split across all features: $O(d \cdot n \log n)$
- Where $d$ is the number of features

The Incremental Update Trick:

When scanning sorted values to find the best threshold, we don't recompute Gini from scratch. Instead:

Start with all samples in the right child
Move one sample at a time from right to left
Increment/decrement class counts (constant time)
Recompute Gini from updated counts ($O(K)$, dominated by the scan)

efficient_gini.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
import numpy as np
from numba import njit
 
@njit
def gini_from_counts_fast(counts: np.ndarray, total: int) -> float:
    """
    Ultra-fast Gini computation using just additions and multiplications.
    
    G = 1 - Σ(n_k/n)² = 1 - Σ(n_k²)/n² = (n² - Σn_k²) / n²
    
    This avoids division until the end.
    """
    if total == 0:
        return 0.0
    
    sum_sq = 0
    for c in counts:
        sum_sq += c * c
    
    # G = (n² - Σn_k²) / n²
    n_sq = total * total
    return (n_sq - sum_sq) / n_sq
 
 
@njit
def find_best_split_fast(sorted_labels: np.ndarray, 
                         sorted_values: np.ndarray,
                         n_classes: int) -> tuple:
    """
    Efficient split finding with O(n) scan after sorting.
    
    Uses incremental count updates to avoid recomputing Gini from scratch.
    """
    n = len(sorted_labels)
    
    # Initialize counts
    left_counts = np.zeros(n_classes, dtype=np.int64)
    right_counts = np.zeros(n_classes, dtype=np.int64)
    
    for label in sorted_labels:
        right_counts[label] += 1
    
    best_gini = 1e10
    best_idx = -1
    left_total = 0
    right_total = n
    
    for i in range(n - 1):
        # Move sample i from right to left
        label = sorted_labels[i]
        left_counts[label] += 1
        right_counts[label] -= 1
        left_total += 1
        right_total -= 1
        
        # Skip if not a valid split point
        if sorted_values[i] >= sorted_values[i + 1]:
            continue
        
        # Compute weighted Gini
        left_gini = gini_from_counts_fast(left_counts, left_total)
        right_gini = gini_from_counts_fast(right_counts, right_total)
        
        weighted = (left_total * left_gini + right_total * right_gini) / n
        
        if weighted < best_gini:
            best_gini = weighted
            best_idx = i
    
    return best_idx, best_gini
 
 
# Comparison: Gini vs log operations (entropy)
def benchmark_operations():
    """
    Compare computational cost of Gini vs entropy calculations.
    """
    import time
    
    n_trials = 1000000
    probs = np.random.dirichlet(np.ones(10), n_trials)
    
    # Time Gini: just squares and sums
    start = time.perf_counter()
    gini_vals = 1 - np.sum(probs ** 2, axis=1)
    gini_time = time.perf_counter() - start
    
    # Time Entropy: requires log
    start = time.perf_counter()
    # Add small epsilon to avoid log(0)
    safe_probs = np.clip(probs, 1e-10, 1)
    entropy_vals = -np.sum(safe_probs * np.log2(safe_probs), axis=1)
    entropy_time = time.perf_counter() - start
    
    print(f"\nBenchmark: {n_trials:,} 10-class probability vectors")
    print(f"Gini time:    {gini_time:.4f}s")
    print(f"Entropy time: {entropy_time:.4f}s")
    print(f"Ratio: Entropy is {entropy_time/gini_time:.2f}x slower")
 
 
if __name__ == "__main__":
    benchmark_operations()

Hardware Considerations

Gini Compared to Other Impurity Measures

While Gini impurity is widely used, it's one of several possible impurity measures. Let's briefly compare it to the main alternatives, with detailed treatment reserved for later pages.

The Three Common Impurity Measures:

Measure	Formula	Max Value (Binary)
Gini	$1 - \sum p_k^2$	0.5
Entropy	$-\sum p_k \log_2 p_k$	1.0
Misclassification	$1 - \max_k p_k$	0.5

Quick Intuition:

Gini measures the probability of misclassification under random labeling
Entropy measures the average 'surprise' or information content
Misclassification measures the error rate of the majority-vote classifier

All three are:

Minimized (= 0) at pure distributions
Maximized at the uniform distribution
Symmetric in class relabeling
Concave functions of probability

However, they have different curvatures and sensitivities, leading to different splitting behavior in some cases.

Comparison of Impurity Measures
Property	Gini	Entropy	Misclassification
Computation	Very fast (squares)	Slower (logarithms)	Very fast (max)
Curvature	Moderate	High near boundaries	Piecewise linear
Sensitivity	Balanced	More sensitive to small classes	Insensitive to distribution shape
Splittability	Can always improve with splits	Can always improve with splits	May get stuck (not strictly concave)
Default in CART	Yes	No	No
Default in ID3/C4.5	No	Yes	No

Practical Takeaway

Summary: Mastering Gini Impurity

We've explored Gini impurity comprehensively—from intuition to mathematical rigor to implementation. Let's consolidate the key insights:

Key Takeaways

•Definition: $G = 1 - \sum p_k^2$ measures how 'mixed' a node is with respect to class labels.
•Bounds: Ranges from 0 (pure) to $1 - 1/K$ (uniform distribution over K classes).
•Probabilistic meaning: Expected misclassification rate under random labeling; probability two random samples differ.
•Geometric view: Gini measures distance from the corners of the probability simplex (pure distributions).
•Gini Gain: $G(parent) - \text{weighted}(G(children))$ quantifies split quality.
•Computational efficiency: Only squares and sums—no logarithms—making it faster than entropy.
•Practice: Default criterion in CART, scikit-learn, and many tree-based methods.
•Equivalence: Produces very similar trees to entropy in most practical cases.

What's Next:

Page Complete

1 / 5