Loading content...
In machine learning, data partitioning is a foundational operation that enables algorithms to split datasets based on specific criteria. This technique is essential for building decision trees, implementing conditional logic in preprocessing pipelines, and creating data subsets for targeted analysis.
Given a dataset represented as a 2D array X with shape (n_samples, n_features), a feature index indicating which column to examine, and a threshold value, your task is to partition the dataset into two distinct subsets:
Mathematical Formulation:
For a sample $x_i$ with feature values $[x_{i,0}, x_{i,1}, ..., x_{i,m}]$, and given feature index $j$ and threshold $t$:
$$\text{Subset A} = {x_i \mid x_{i,j} \geq t}$$ $$\text{Subset B} = {x_i \mid x_{i,j} < t}$$
Important Properties:
Your Task: Write a Python function that partitions a dataset into two subsets based on the threshold condition for the specified feature. Return the two subsets as a list of two NumPy arrays, with the subset meeting the condition (≥ threshold) listed first, followed by the subset not meeting the condition (< threshold).
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]])
feature_i = 0
threshold = 5[array([[5, 6], [7, 8], [9, 10]]), array([[1, 2], [3, 4]])]The dataset is partitioned based on the first column (feature index 0) with threshold 5.
• Samples with column 0 value ≥ 5: [5, 6], [7, 8], [9, 10] → First subset • Samples with column 0 value < 5: [1, 2], [3, 4] → Second subset
The samples are grouped while maintaining their original relative order within each subset.
X = np.array([[1, 5], [2, 3], [4, 7], [6, 2], [8, 9]])
feature_i = 1
threshold = 5[array([[1, 5], [4, 7], [8, 9]]), array([[2, 3], [6, 2]])]The dataset is partitioned based on the second column (feature index 1) with threshold 5.
• Samples with column 1 value ≥ 5: [1, 5], [4, 7], [8, 9] → First subset • Samples with column 1 value < 5: [2, 3], [6, 2] → Second subset
Note that even though [6, 2] has a larger value in column 0, it's placed in the second subset because its column 1 value (2) is less than 5.
X = np.array([[-5, 2], [-3, 4], [0, 6], [3, 8], [5, 10]])
feature_i = 0
threshold = 0[array([[0, 6], [3, 8], [5, 10]]), array([[-5, 2], [-3, 4]])]The dataset is partitioned based on the first column (feature index 0) with threshold 0.
• Samples with column 0 value ≥ 0: [0, 6], [3, 8], [5, 10] → First subset (includes the boundary value 0) • Samples with column 0 value < 0: [-5, 2], [-3, 4] → Second subset (negative values only)
This demonstrates handling of negative values and the inclusive nature of the ≥ comparison for the first subset.
Constraints