Loading content...
When building machine learning classification models, splitting your dataset into training and testing subsets is a critical preprocessing step. However, a naive random split can lead to class imbalance issues where the proportion of classes in the training set differs significantly from the original dataset—especially problematic with imbalanced datasets where some classes have far fewer samples than others.
Proportional class partitioning (also known as stratified splitting) addresses this challenge by ensuring that each subset maintains the same relative frequency of class labels as the original dataset. This technique is essential for:
Given a dataset with features X and class labels y, along with a desired test proportion:
For a dataset with C distinct classes where class c contains nₖ samples:
$$n_{test,c} = \lfloor n_c \times \text{test_size} \rfloor$$
$$n_{train,c} = n_c - n_{test,c}$$
The final training set contains Σ n_{train,c} samples and the test set contains Σ n_{test,c} samples, with class proportions preserved in both.
Your Task: Implement a function that performs proportional class partitioning on a given dataset. The function should split the data while maintaining class balance in both resulting subsets, using random shuffling within each class for reproducibility when a seed is provided.
X = [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0], [7.0, 8.0], [9.0, 10.0], [11.0, 12.0]]
y = [0, 0, 0, 1, 1, 1]
test_size = 0.5
random_seed = 42X_train = [[3.0, 4.0], [5.0, 6.0], [11.0, 12.0], [7.0, 8.0]]
X_test = [[1.0, 2.0], [9.0, 10.0]]
y_train = [0, 0, 1, 1]
y_test = [0, 1]The dataset has 6 samples with 3 samples each from class 0 and class 1. With test_size=0.5, we calculate floor(3 × 0.5) = 1 sample per class for testing.
For class 0 (indices 0, 1, 2): After shuffling with seed 42, we select 1 sample for testing and 2 for training.
For class 1 (indices 3, 4, 5): Similarly shuffled and split, yielding 1 test sample and 2 training samples.
The final split maintains the 50-50 class ratio: training has 2 samples of each class, and testing has 1 sample of each class. This ensures both sets are representative of the original class distribution.
X = [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0], [7.0, 8.0]]
y = [0, 0, 1, 1]
test_size = 0.25
random_seed = 123X_train = [[3.0, 4.0], [1.0, 2.0], [5.0, 6.0], [7.0, 8.0]]
X_test = []
y_train = [0, 0, 1, 1]
y_test = []With only 2 samples per class and test_size=0.25, we calculate floor(2 × 0.25) = floor(0.5) = 0 samples per class for testing. Since the floor operation yields zero, no samples are allocated to the test set from either class.
The entire dataset (after shuffling within classes with the random seed) becomes the training set. This edge case demonstrates how small class sizes combined with low test proportions can result in empty test sets, highlighting the importance of having sufficient samples per class for meaningful splits.
X = [[1.0], [2.0], [3.0], [4.0], [5.0], [6.0], [7.0], [8.0], [9.0]]
y = [0, 0, 0, 1, 1, 1, 2, 2, 2]
test_size = 0.33
random_seed = 7X_train = [[3.0], [2.0], [1.0], [4.0], [5.0], [6.0], [9.0], [8.0], [7.0]]
X_test = []
y_train = [0, 0, 0, 1, 1, 1, 2, 2, 2]
y_test = []This example demonstrates multi-class stratification with 3 classes, each having 3 samples. With test_size=0.33, we calculate floor(3 × 0.33) = floor(0.99) = 0 samples per class for testing.
Despite the intuition that ~33% of 3 samples should yield 1 test sample, the floor operation rounds 0.99 down to 0. All samples remain in the training set (shuffled by class with seed=7), and the test set is empty.
This illustrates a key characteristic: the algorithm is conservative and will never allocate a sample to testing unless the product strictly reaches or exceeds 1 before flooring.
Constraints