Loading learning content...
At the core of every decision tree lies a deceptively simple question: How do we decide which feature to split on? The answer requires a formal measure of 'purity'—a way to quantify how mixed or homogeneous a set of examples is with respect to their class labels.
Gini impurity is one of the most widely used such measures. Named after the Italian statistician Corrado Gini (who introduced the Gini coefficient for income inequality in 1912), Gini impurity has become the default splitting criterion in CART (Classification and Regression Trees) and many of the most popular decision tree implementations, including scikit-learn.
But Gini impurity isn't just a convenient formula—it embodies deep principles about classification, probability, and optimal decision-making. Understanding it thoroughly will not only help you build better trees but also illuminate fundamental concepts in machine learning.
By the end of this page, you will understand Gini impurity from multiple perspectives: its intuitive meaning, mathematical definition, geometric interpretation, computational properties, and behavior in practice. You'll be able to derive it, compute it efficiently, and reason about when and why it works well.
Before diving into formulas, let's build intuition. Imagine you have a basket of fruits and you want to characterize how 'mixed' it is.
Scenario 1: The basket contains 100 apples and 0 oranges.
Scenario 2: The basket contains 50 apples and 50 oranges.
Scenario 3: The basket contains 80 apples and 20 oranges.
We need a function that:
Gini impurity is precisely such a function.
Here's a powerful intuition: Gini impurity measures the probability that a randomly chosen element would be incorrectly classified if we assigned labels based on the class distribution. In other words, it quantifies expected classification error under random labeling.
Let's formalize Gini impurity mathematically. Consider a node in a decision tree containing $n$ examples belonging to $K$ classes. Let $p_k$ denote the proportion of examples belonging to class $k$:
$$p_k = \frac{n_k}{n}$$
where $n_k$ is the number of examples of class $k$ and $\sum_{k=1}^{K} p_k = 1$.
The Gini impurity is defined as:
$$G = \sum_{k=1}^{K} p_k (1 - p_k) = \sum_{k=1}^{K} p_k - \sum_{k=1}^{K} p_k^2 = 1 - \sum_{k=1}^{K} p_k^2$$
The three equivalent forms above reveal different aspects of Gini impurity:
Form 1: $\sum_{k=1}^{K} p_k (1 - p_k)$
Form 2: $1 - \sum_{k=1}^{K} p_k^2$
Form 3 (for binary classification):
The term $\sum_{k=1}^{K} p_k^2$ is known as the collision probability or Simpson's diversity index. It measures the chance that two randomly selected items belong to the same class. Gini impurity = 1 - collision probability, so it measures heterogeneity.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990
import numpy as npfrom typing import Union def gini_impurity(class_counts: Union[list, np.ndarray]) -> float: """ Compute Gini impurity from class counts. The Gini impurity measures the probability that a randomly chosen element would be incorrectly classified under random labeling based on the class distribution. Args: class_counts: Array or list of counts for each class Returns: Gini impurity value in range [0, 1 - 1/K] Mathematical Definition: G = 1 - Σ(p_k²) where p_k = n_k / n Properties: - G = 0 for perfectly pure nodes (all samples same class) - G = 1 - 1/K for uniform distribution over K classes - For binary: maximum is 0.5 when p = 0.5 Example: >>> gini_impurity([50, 50]) # Binary, balanced 0.5 >>> gini_impurity([100, 0]) # Pure node 0.0 >>> gini_impurity([30, 30, 30]) # 3-class, uniform 0.6666666666666667 """ counts = np.array(class_counts, dtype=np.float64) n = counts.sum() if n == 0: return 0.0 # Empty node has no impurity # Compute class probabilities p = counts / n # Gini = 1 - Σ(p_k²) return 1.0 - np.sum(p ** 2) def gini_impurity_binary(p: float) -> float: """ Compute Gini impurity for binary classification. For two classes with proportions p and (1-p): G = 2p(1-p) = 1 - p² - (1-p)² Args: p: Proportion of positive class (0 ≤ p ≤ 1) Returns: Gini impurity value in range [0, 0.5] Note: Maximum of 0.5 occurs at p = 0.5 (most uncertain) Minimum of 0.0 occurs at p = 0 or p = 1 (pure) """ return 2 * p * (1 - p) # Demonstrationif __name__ == "__main__": print("Gini Impurity Examples:") print("=" * 50) # Pure node print(f"Pure [100, 0]: G = {gini_impurity([100, 0]):.4f}") # Balanced binary print(f"Balanced [50, 50]: G = {gini_impurity([50, 50]):.4f}") # Imbalanced print(f"[80, 20]: G = {gini_impurity([80, 20]):.4f}") print(f"[90, 10]: G = {gini_impurity([90, 10]):.4f}") print(f"[99, 1]: G = {gini_impurity([99, 1]):.4f}") # Multi-class print(f"\n3-class uniform [33, 33, 34]: G = {gini_impurity([33, 33, 34]):.4f}") print(f"3-class max = 1 - 1/3 = {1 - 1/3:.4f}") # Verify binary formula print(f"\nBinary formula verification:") for p in [0.0, 0.1, 0.3, 0.5, 0.7, 0.9, 1.0]: print(f" p={p:.1f}: G = {gini_impurity_binary(p):.4f}")Understanding the mathematical properties of Gini impurity is essential for reasoning about decision tree behavior. Let's examine these rigorously.
| Classes (K) | Minimum Gini | Maximum Gini | At Maximum Distribution |
|---|---|---|---|
| 2 | 0.0 | 0.5 | [0.5, 0.5] |
| 3 | 0.0 | 0.667 | [0.333, 0.333, 0.333] |
| 4 | 0.0 | 0.75 | [0.25, 0.25, 0.25, 0.25] |
| 5 | 0.0 | 0.8 | [0.2, 0.2, 0.2, 0.2, 0.2] |
| K | 0.0 | 1 - 1/K | [1/K, 1/K, ..., 1/K] |
For many classes, the maximum Gini approaches 1. This means multi-class problems have a larger impurity 'range' to work with. However, achieving meaningful reductions becomes harder because the probability mass is spread thin.
Proof of Maximum:
We want to maximize $G = 1 - \sum_{k=1}^K p_k^2$ subject to $\sum_{k=1}^K p_k = 1$.
This is equivalent to minimizing $\sum p_k^2$ subject to the constraint. Using Lagrange multipliers:
$$\mathcal{L} = \sum p_k^2 + \lambda(\sum p_k - 1)$$
Taking derivatives: $\frac{\partial \mathcal{L}}{\partial p_k} = 2p_k + \lambda = 0$
This gives $p_k = -\lambda/2$ for all $k$. Since probabilities must be equal and sum to 1:
$$p_k = \frac{1}{K} \text{ for all } k$$
Substituting: $G_{max} = 1 - K \cdot (1/K)^2 = 1 - 1/K$ ∎
Gini impurity has beautiful geometric interpretations that provide additional insight.
Binary Classification Geometry:
For two classes with probabilities $p$ and $(1-p)$: $$G = 2p(1-p)$$
This is a parabola opening downward with:
The parabola $y = 2p(1-p)$ is symmetric about $p = 0.5$, reflecting that swapping class labels doesn't change impurity.
Simplex Geometry (Multi-class):
For $K$ classes, the probability distributions live on the $(K-1)$-dimensional probability simplex: $$\Delta^{K-1} = {(p_1, ..., p_K) : p_k \geq 0, \sum p_k = 1}$$
Gini impurity can be rewritten as: $$G = 1 - ||\mathbf{p}||^2$$
where $||\mathbf{p}||^2 = \sum p_k^2$ is the squared Euclidean norm.
This means:
Geometrically, Gini impurity measures how far the probability vector is from the vertices (corners) of the simplex. The corners represent pure distributions, and Gini increases as you move toward the center.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126
import numpy as npimport matplotlib.pyplot as pltfrom mpl_toolkits.mplot3d import Axes3D def visualize_gini_binary(): """ Visualize Gini impurity for binary classification. Shows the characteristic parabola G = 2p(1-p). """ fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Left: Gini parabola p = np.linspace(0, 1, 100) gini = 2 * p * (1 - p) ax1 = axes[0] ax1.plot(p, gini, 'b-', linewidth=2.5, label='Gini = 2p(1-p)') ax1.fill_between(p, gini, alpha=0.2) # Mark key points ax1.scatter([0, 0.5, 1], [0, 0.5, 0], c='red', s=100, zorder=5) ax1.annotate('Pure\n(class 0)', (0, 0), xytext=(0.05, 0.1), fontsize=10, ha='left') ax1.annotate('Maximum\nImpurity', (0.5, 0.5), xytext=(0.52, 0.45), fontsize=10, ha='left') ax1.annotate('Pure\n(class 1)', (1, 0), xytext=(0.85, 0.1), fontsize=10, ha='left') ax1.set_xlabel('Proportion of Class 1 (p)', fontsize=12) ax1.set_ylabel('Gini Impurity', fontsize=12) ax1.set_title('Binary Gini Impurity: A Downward Parabola', fontsize=14) ax1.set_xlim(-0.05, 1.05) ax1.set_ylim(-0.05, 0.6) ax1.grid(True, alpha=0.3) ax1.legend(loc='upper right') # Right: Gradient showing rate of change ax2 = axes[1] gradient = 2 * (1 - 2*p) # d/dp of 2p(1-p) ax2.plot(p, gradient, 'g-', linewidth=2.5, label="dG/dp = 2(1-2p)") ax2.axhline(y=0, color='black', linestyle='--', alpha=0.5) ax2.axvline(x=0.5, color='red', linestyle='--', alpha=0.5, label='p=0.5') ax2.set_xlabel('Proportion of Class 1 (p)', fontsize=12) ax2.set_ylabel('Rate of Change of Gini', fontsize=12) ax2.set_title('Gini Gradient: Steepest Near Purity', fontsize=14) ax2.grid(True, alpha=0.3) ax2.legend() plt.tight_layout() plt.savefig('gini_binary_geometry.png', dpi=150, bbox_inches='tight') plt.show() def visualize_gini_ternary(): """ Visualize Gini impurity on the 3-class probability simplex. Creates a triangular heatmap with Gini contours. """ # Generate points in the 2-simplex (triangle) resolution = 100 points = [] gini_values = [] for i in range(resolution + 1): for j in range(resolution + 1 - i): k = resolution - i - j p1, p2, p3 = i/resolution, j/resolution, k/resolution points.append([p1, p2, p3]) # Gini = 1 - sum(p^2) gini = 1 - (p1**2 + p2**2 + p3**2) gini_values.append(gini) points = np.array(points) gini_values = np.array(gini_values) # Transform to 2D coordinates for the triangle # Use barycentric to Cartesian transformation x = points[:, 1] + 0.5 * points[:, 2] y = (np.sqrt(3) / 2) * points[:, 2] fig, ax = plt.subplots(figsize=(10, 9)) # Create triangular scatter plot with Gini coloring scatter = ax.scatter(x, y, c=gini_values, cmap='viridis', s=20, alpha=0.8) # Draw simplex boundary triangle = plt.Polygon([[0, 0], [1, 0], [0.5, np.sqrt(3)/2]], fill=False, edgecolor='black', linewidth=2) ax.add_patch(triangle) # Label vertices ax.annotate('Class 1\n(Pure, G=0)', (0, 0), xytext=(-0.15, -0.08), fontsize=11, ha='center', fontweight='bold') ax.annotate('Class 2\n(Pure, G=0)', (1, 0), xytext=(1.15, -0.08), fontsize=11, ha='center', fontweight='bold') ax.annotate('Class 3\n(Pure, G=0)', (0.5, np.sqrt(3)/2), xytext=(0.5, np.sqrt(3)/2 + 0.1), fontsize=11, ha='center', fontweight='bold') # Mark center (max impurity) ax.scatter([0.5], [np.sqrt(3)/6], c='red', s=200, marker='*', edgecolors='white', linewidths=2, zorder=5) ax.annotate('Center\n(G = 0.667)', (0.5, np.sqrt(3)/6), xytext=(0.65, 0.35), fontsize=10, ha='left', arrowprops=dict(arrowstyle='->', color='red')) cbar = plt.colorbar(scatter, ax=ax, shrink=0.6) cbar.set_label('Gini Impurity', fontsize=12) ax.set_xlim(-0.2, 1.2) ax.set_ylim(-0.2, 1.1) ax.set_aspect('equal') ax.axis('off') ax.set_title('Gini Impurity on the 3-Class Probability Simplex', fontsize=14, pad=20) plt.savefig('gini_simplex.png', dpi=150, bbox_inches='tight') plt.show() if __name__ == "__main__": visualize_gini_binary() visualize_gini_ternary()The most profound interpretation of Gini impurity is probabilistic. It represents the expected error rate under random labeling.
The Random Labeling Experiment:
$$P(\text{error}) = \sum_{k=1}^K p_k \cdot P(\text{wrong label} | \text{class } k) = \sum_{k=1}^K p_k \cdot (1 - p_k) = G$$
This is exactly Gini impurity!
Alternative Derivation (Two-Sample Collision):
Draw two samples independently with replacement from the node:
Gini impurity is the probability that two randomly chosen items belong to different classes.
In ecology, 1 - Σpk² is called the Simpson's Diversity Index and measures species diversity. In physics, it appears in the Herfindahl-Hirschman Index for market concentration. Gini impurity is a universal measure of heterogeneity.
Why This Matters for Decision Trees:
The expected misclassification interpretation explains why minimizing Gini impurity leads to good classification:
Pure nodes have zero expected error — If all items are the same class, random labeling based on the distribution assigns the correct label with probability 1.
Splits that reduce Gini reduce expected error — By separating classes into different branches, we reduce the 'confusion' in each child node.
Weighted average matters — When splitting, we compute weighted Gini, which is the expected error across all possible paths through the tree.
This probabilistic foundation gives Gini impurity a direct connection to classification performance, making it a principled choice for tree construction.
In decision trees, we don't just compute Gini impurity—we use it to evaluate potential splits. The Gini gain (or reduction in Gini impurity) measures how much a split improves node purity.
Gini Gain Definition:
For a parent node $S$ with Gini impurity $G(S)$, consider a binary split into children $S_L$ and $S_R$ with sizes $n_L$ and $n_R$:
$$\text{Gini Gain} = G(S) - \left( \frac{n_L}{n} G(S_L) + \frac{n_R}{n} G(S_R) \right)$$
The term in parentheses is the weighted average Gini of the children. Gini gain is always non-negative (by concavity of Gini), and larger gain means a better split.
The Splitting Algorithm:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142
import numpy as npfrom typing import Tuple, List def gini_impurity(class_counts: np.ndarray) -> float: """Compute Gini impurity from class counts.""" n = class_counts.sum() if n == 0: return 0.0 p = class_counts / n return 1.0 - np.sum(p ** 2) def weighted_gini(left_counts: np.ndarray, right_counts: np.ndarray) -> float: """ Compute weighted average Gini impurity for a split. This is what we minimize when selecting splits: weighted_G = (n_L/n) * G(left) + (n_R/n) * G(right) """ n_left = left_counts.sum() n_right = right_counts.sum() n_total = n_left + n_right if n_total == 0: return 0.0 weight_left = n_left / n_total weight_right = n_right / n_total return weight_left * gini_impurity(left_counts) + \ weight_right * gini_impurity(right_counts) def gini_gain(parent_counts: np.ndarray, left_counts: np.ndarray, right_counts: np.ndarray) -> float: """ Compute Gini gain for a split. Gini Gain = G(parent) - weighted_G(children) Larger values indicate better splits that more effectively separate classes. """ parent_gini = gini_impurity(parent_counts) children_weighted_gini = weighted_gini(left_counts, right_counts) return parent_gini - children_weighted_gini def find_best_split(X: np.ndarray, y: np.ndarray, feature_idx: int) -> Tuple[float, float, float]: """ Find the best split threshold for a single feature. Args: X: Feature matrix (n_samples, n_features) y: Labels (n_samples,) feature_idx: Index of feature to split on Returns: (best_threshold, best_gini_gain, weighted_gini_after_split) """ n_samples = len(y) n_classes = len(np.unique(y)) # Get feature values and sort feature_values = X[:, feature_idx] sorted_indices = np.argsort(feature_values) sorted_features = feature_values[sorted_indices] sorted_labels = y[sorted_indices] # Parent statistics parent_counts = np.bincount(y, minlength=n_classes) parent_gini = gini_impurity(parent_counts) best_gain = -np.inf best_threshold = None best_weighted_gini = None # Initialize left as empty, right as all left_counts = np.zeros(n_classes, dtype=np.int64) right_counts = parent_counts.copy() # Scan through all possible split points for i in range(n_samples - 1): # Move sample i from right to left c = sorted_labels[i] left_counts[c] += 1 right_counts[c] -= 1 # Skip if no actual threshold here (same value) if sorted_features[i] == sorted_features[i + 1]: continue # Compute gain for this split w_gini = weighted_gini(left_counts, right_counts) gain = parent_gini - w_gini if gain > best_gain: best_gain = gain best_threshold = (sorted_features[i] + sorted_features[i + 1]) / 2 best_weighted_gini = w_gini return best_threshold, best_gain, best_weighted_gini # Demonstrationif __name__ == "__main__": print("Gini Split Demonstration") print("=" * 60) # Example: Parent node with two classes parent = np.array([50, 50]) # Balanced binary print(f"\nParent: {parent}") print(f"Parent Gini: {gini_impurity(parent):.4f}") # Perfect split left_perfect = np.array([50, 0]) right_perfect = np.array([0, 50]) print(f"\nPerfect split:") print(f" Left {left_perfect}: G = {gini_impurity(left_perfect):.4f}") print(f" Right {right_perfect}: G = {gini_impurity(right_perfect):.4f}") print(f" Weighted Gini: {weighted_gini(left_perfect, right_perfect):.4f}") print(f" Gini Gain: {gini_gain(parent, left_perfect, right_perfect):.4f}") # Imperfect split left_imperfect = np.array([40, 10]) right_imperfect = np.array([10, 40]) print(f"\nImperfect split:") print(f" Left {left_imperfect}: G = {gini_impurity(left_imperfect):.4f}") print(f" Right {right_imperfect}: G = {gini_impurity(right_imperfect):.4f}") print(f" Weighted Gini: {weighted_gini(left_imperfect, right_imperfect):.4f}") print(f" Gini Gain: {gini_gain(parent, left_imperfect, right_imperfect):.4f}") # Bad split (unbalanced) left_bad = np.array([25, 25]) right_bad = np.array([25, 25]) print(f"\nNo-change split:") print(f" Left {left_bad}: G = {gini_impurity(left_bad):.4f}") print(f" Right {right_bad}: G = {gini_impurity(right_bad):.4f}") print(f" Weighted Gini: {weighted_gini(left_bad, right_bad):.4f}") print(f" Gini Gain: {gini_gain(parent, left_bad, right_bad):.4f}")We weight by node size because larger nodes contribute more to overall classification error. A split that purifies a node with 1000 samples is more valuable than one that purifies a node with 10 samples.
One of Gini impurity's practical advantages is its computational efficiency. Unlike entropy (which requires logarithms), Gini only needs basic arithmetic operations.
Complexity Analysis:
Computing Gini for a node: $O(K)$ where $K$ is the number of classes
Evaluating all splits for one feature:
Finding best split across all features: $O(d \cdot n \log n)$
The Incremental Update Trick:
When scanning sorted values to find the best threshold, we don't recompute Gini from scratch. Instead:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102
import numpy as npfrom numba import njit @njitdef gini_from_counts_fast(counts: np.ndarray, total: int) -> float: """ Ultra-fast Gini computation using just additions and multiplications. G = 1 - Σ(n_k/n)² = 1 - Σ(n_k²)/n² = (n² - Σn_k²) / n² This avoids division until the end. """ if total == 0: return 0.0 sum_sq = 0 for c in counts: sum_sq += c * c # G = (n² - Σn_k²) / n² n_sq = total * total return (n_sq - sum_sq) / n_sq @njitdef find_best_split_fast(sorted_labels: np.ndarray, sorted_values: np.ndarray, n_classes: int) -> tuple: """ Efficient split finding with O(n) scan after sorting. Uses incremental count updates to avoid recomputing Gini from scratch. """ n = len(sorted_labels) # Initialize counts left_counts = np.zeros(n_classes, dtype=np.int64) right_counts = np.zeros(n_classes, dtype=np.int64) for label in sorted_labels: right_counts[label] += 1 best_gini = 1e10 best_idx = -1 left_total = 0 right_total = n for i in range(n - 1): # Move sample i from right to left label = sorted_labels[i] left_counts[label] += 1 right_counts[label] -= 1 left_total += 1 right_total -= 1 # Skip if not a valid split point if sorted_values[i] >= sorted_values[i + 1]: continue # Compute weighted Gini left_gini = gini_from_counts_fast(left_counts, left_total) right_gini = gini_from_counts_fast(right_counts, right_total) weighted = (left_total * left_gini + right_total * right_gini) / n if weighted < best_gini: best_gini = weighted best_idx = i return best_idx, best_gini # Comparison: Gini vs log operations (entropy)def benchmark_operations(): """ Compare computational cost of Gini vs entropy calculations. """ import time n_trials = 1000000 probs = np.random.dirichlet(np.ones(10), n_trials) # Time Gini: just squares and sums start = time.perf_counter() gini_vals = 1 - np.sum(probs ** 2, axis=1) gini_time = time.perf_counter() - start # Time Entropy: requires log start = time.perf_counter() # Add small epsilon to avoid log(0) safe_probs = np.clip(probs, 1e-10, 1) entropy_vals = -np.sum(safe_probs * np.log2(safe_probs), axis=1) entropy_time = time.perf_counter() - start print(f"\nBenchmark: {n_trials:,} 10-class probability vectors") print(f"Gini time: {gini_time:.4f}s") print(f"Entropy time: {entropy_time:.4f}s") print(f"Ratio: Entropy is {entropy_time/gini_time:.2f}x slower") if __name__ == "__main__": benchmark_operations()Gini's simpler arithmetic (no transcendental functions) also benefits from modern CPU optimizations. SIMD instructions can evaluate Gini across multiple splits simultaneously, making it 2-4x faster than entropy in typical tree implementations.
While Gini impurity is widely used, it's one of several possible impurity measures. Let's briefly compare it to the main alternatives, with detailed treatment reserved for later pages.
The Three Common Impurity Measures:
| Measure | Formula | Max Value (Binary) |
|---|---|---|
| Gini | $1 - \sum p_k^2$ | 0.5 |
| Entropy | $-\sum p_k \log_2 p_k$ | 1.0 |
| Misclassification | $1 - \max_k p_k$ | 0.5 |
Quick Intuition:
All three are:
However, they have different curvatures and sensitivities, leading to different splitting behavior in some cases.
| Property | Gini | Entropy | Misclassification |
|---|---|---|---|
| Computation | Very fast (squares) | Slower (logarithms) | Very fast (max) |
| Curvature | Moderate | High near boundaries | Piecewise linear |
| Sensitivity | Balanced | More sensitive to small classes | Insensitive to distribution shape |
| Splittability | Can always improve with splits | Can always improve with splits | May get stuck (not strictly concave) |
| Default in CART | Yes | No | No |
| Default in ID3/C4.5 | No | Yes | No |
In practice, Gini and entropy produce very similar trees in most cases (studies show ~2% difference in tree structure). Gini is preferred when computational speed matters; entropy is preferred when class imbalance is extreme or information-theoretic justification is desired.
We've explored Gini impurity comprehensively—from intuition to mathematical rigor to implementation. Let's consolidate the key insights:
What's Next:
Now that we've mastered Gini impurity, the next page explores Entropy and Information Gain—the alternative criterion rooted in information theory. We'll see how Claude Shannon's revolutionary ideas from communication theory provide a different but equally valid perspective on node purity.
You now have a deep, multi-faceted understanding of Gini impurity. This foundation will serve you well as we explore entropy and compare these criteria systematically. Remember: Gini's power lies in its simplicity and probabilistic interpretation.