Loading problem...
Gini impurity is a fundamental metric used in supervised machine learning to quantify the disorder or heterogeneity within a dataset's class distribution. This measure is central to decision tree algorithms, where it guides the selection of optimal splitting criteria to create the most informative partitions.
The Gini impurity for a node containing samples from multiple classes is calculated using the following formula:
$$\text{Gini Impurity} = 1 - \sum_{k=1}^{K} p_k^2$$
Where:
Intuition Behind Gini Impurity:
Your Task:
Implement a function that computes the Gini impurity for a given list of class labels. The function should:
labels = [0, 1, 1, 1, 0]0.48The dataset contains 5 samples: 2 labeled as class 0 and 3 labeled as class 1.
Step 1: Calculate class probabilities: • p₀ = 2/5 = 0.4 (probability of class 0) • p₁ = 3/5 = 0.6 (probability of class 1)
Step 2: Apply Gini formula: Gini = 1 - (p₀² + p₁²) Gini = 1 - (0.4² + 0.6²) Gini = 1 - (0.16 + 0.36) Gini = 1 - 0.52 Gini = 0.48
The impurity of 0.48 indicates a reasonably mixed distribution, close to the maximum possible impurity of 0.5 for binary classification.
labels = [1, 1, 1, 1]0.0All 4 samples belong to class 1, representing a perfectly pure node.
Step 1: Calculate class probabilities: • p₁ = 4/4 = 1.0 (100% of samples are class 1)
Step 2: Apply Gini formula: Gini = 1 - (p₁²) Gini = 1 - (1.0²) Gini = 1 - 1.0 Gini = 0.0
A Gini impurity of 0.0 indicates perfect classification purity — this node would be a leaf node in a decision tree since no further splitting is beneficial.
labels = [0, 0, 1, 1]0.5The dataset is perfectly balanced with 2 samples each from class 0 and class 1.
Step 1: Calculate class probabilities: • p₀ = 2/4 = 0.5 (50% are class 0) • p₁ = 2/4 = 0.5 (50% are class 1)
Step 2: Apply Gini formula: Gini = 1 - (p₀² + p₁²) Gini = 1 - (0.5² + 0.5²) Gini = 1 - (0.25 + 0.25) Gini = 1 - 0.5 Gini = 0.5
This represents the maximum possible Gini impurity for binary classification. A 50-50 split provides no predictive power, as predicting either class would be equally likely to be correct.
Constraints