0/318

00:00:00

Description

Editorial

Proportional Class Partitioning for Dataset Splitting

MEDIUM20 pts

When building machine learning classification models, splitting your dataset into training and testing subsets is a critical preprocessing step. However, a naive random split can lead to class imbalance issues where the proportion of classes in the training set differs significantly from the original dataset—especially problematic with imbalanced datasets where some classes have far fewer samples than others.

Proportional class partitioning (also known as stratified splitting) addresses this challenge by ensuring that each subset maintains the same relative frequency of class labels as the original dataset. This technique is essential for:

Fair evaluation: Test metrics accurately reflect performance across all classes
Consistent training: The model learns from a representative sample of each class
Avoiding bias: Rare classes are not accidentally excluded from training or testing

How It Works

Given a dataset with features X and class labels y, along with a desired test proportion:

Group samples by class: Identify all unique classes and their associated sample indices
Shuffle within each class: Randomize the order of samples within each class group using the provided seed
Calculate per-class split: For each class with n samples, allocate floor(n × test_size) samples to the test set
Aggregate results: Combine the training samples from all classes to form X_train and y_train; similarly for the test set

Mathematical Formulation

For a dataset with C distinct classes where class c contains nₖ samples:

$$n_{test,c} = \lfloor n_c \times \text{test_size} \rfloor$$

$$n_{train,c} = n_c - n_{test,c}$$

The final training set contains Σ n_{train,c} samples and the test set contains Σ n_{test,c} samples, with class proportions preserved in both.

Your Task: Implement a function that performs proportional class partitioning on a given dataset. The function should split the data while maintaining class balance in both resulting subsets, using random shuffling within each class for reproducibility when a seed is provided.

Example

Input

X = [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0], [7.0, 8.0], [9.0, 10.0], [11.0, 12.0]]
y = [0, 0, 0, 1, 1, 1]
test_size = 0.5
random_seed = 42

Output

X_train = [[3.0, 4.0], [5.0, 6.0], [11.0, 12.0], [7.0, 8.0]]
X_test = [[1.0, 2.0], [9.0, 10.0]]
y_train = [0, 0, 1, 1]
y_test = [0, 1]

Explanation

The dataset has 6 samples with 3 samples each from class 0 and class 1. With test_size=0.5, we calculate floor(3 × 0.5) = 1 sample per class for testing.

For class 0 (indices 0, 1, 2): After shuffling with seed 42, we select 1 sample for testing and 2 for training.

For class 1 (indices 3, 4, 5): Similarly shuffled and split, yielding 1 test sample and 2 training samples.

The final split maintains the 50-50 class ratio: training has 2 samples of each class, and testing has 1 sample of each class. This ensures both sets are representative of the original class distribution.

Example

Input

X = [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0], [7.0, 8.0]]
y = [0, 0, 1, 1]
test_size = 0.25
random_seed = 123

Output

X_train = [[3.0, 4.0], [1.0, 2.0], [5.0, 6.0], [7.0, 8.0]]
X_test = []
y_train = [0, 0, 1, 1]
y_test = []

Explanation

With only 2 samples per class and test_size=0.25, we calculate floor(2 × 0.25) = floor(0.5) = 0 samples per class for testing. Since the floor operation yields zero, no samples are allocated to the test set from either class.

The entire dataset (after shuffling within classes with the random seed) becomes the training set. This edge case demonstrates how small class sizes combined with low test proportions can result in empty test sets, highlighting the importance of having sufficient samples per class for meaningful splits.

Example

Input

X = [[1.0], [2.0], [3.0], [4.0], [5.0], [6.0], [7.0], [8.0], [9.0]]
y = [0, 0, 0, 1, 1, 1, 2, 2, 2]
test_size = 0.33
random_seed = 7

Output

X_train = [[3.0], [2.0], [1.0], [4.0], [5.0], [6.0], [9.0], [8.0], [7.0]]
X_test = []
y_train = [0, 0, 0, 1, 1, 1, 2, 2, 2]
y_test = []

Explanation

This example demonstrates multi-class stratification with 3 classes, each having 3 samples. With test_size=0.33, we calculate floor(3 × 0.33) = floor(0.99) = 0 samples per class for testing.

Despite the intuition that ~33% of 3 samples should yield 1 test sample, the floor operation rounds 0.99 down to 0. All samples remain in the training set (shuffled by class with seed=7), and the test set is empty.

This illustrates a key characteristic: the algorithm is conservative and will never allocate a sample to testing unless the product strictly reaches or exceeds 1 before flooring.

Accepted0/0·0% Acceptance

Constraints

1 ≤ n_samples ≤ 10,000 (total number of samples)
1 ≤ n_features ≤ 1,000 (number of features per sample)
1 ≤ number of unique classes ≤ 100
0 < test_size < 1 (must be a valid proportion)
-10⁶ ≤ X[i][j] ≤ 10⁶ (feature values)
All class labels in y are non-negative integers
Each class must have at least 1 sample
random_seed is a non-negative integer when provided

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

X =

[[1,2],[3,4],[5,6],[7,8],[9,10],[11,12]]

y =

[0,0,0,1,1,1]

test_size =

0.5

random_seed =

How It Works

Given a dataset with features X and class labels y, along with a desired test proportion:

Group samples by class: Identify all unique classes and their associated sample indices

Shuffle within each class: Randomize the order of samples within each class group using the provided seed

Calculate per-class split: For each class with n samples, allocate floor(n × test_size) samples to the test set

Aggregate results: Combine the training samples from all classes to form X_train and y_train; similarly for the test set

Mathematical Formulation

For a dataset with C distinct classes where class c contains nₖ samples:

$$n_{test,c} = \lfloor n_c \times \text{test_size} \rfloor$$

$$n_{train,c} = n_c - n_{test,c}$$

The final training set contains Σ n_{train,c} samples and the test set contains Σ n_{test,c} samples, with class proportions preserved in both.

Proportional Class Partitioning for Dataset Splitting

How It Works

Mathematical Formulation

Hints

Proportional Class Partitioning for Dataset Splitting

How It Works

Mathematical Formulation

Hints