Random Dataset Subset Generation for Ensemble Learning (Medium) — Practice with Code Visualizer

In machine learning, ensemble methods such as Random Forests and Bagging rely on training multiple models on different subsets of the original dataset. This technique, known as bootstrap sampling or dataset subsetting, is fundamental to reducing model variance and improving generalization.

Given a feature matrix X of dimensions n × m (where n is the number of samples and m is the number of features) and a corresponding target array y of length n, your task is to generate multiple random subsets of this dataset.

Sampling Strategies:

With Replacement (Bootstrap Sampling): When replacements = True, each subset contains n samples drawn randomly with replacement. This means the same sample can appear multiple times within a single subset—a cornerstone technique in bootstrap aggregating (bagging).
Without Replacement: When replacements = False, each subset contains n // 2 samples (integer division) drawn randomly without replacement. Each sample appears at most once within a subset.

Mathematical Foundation:

For a dataset with n samples:

Bootstrap sampling (with replacement) creates subsets of size n, where each sample has probability 1/n of being selected at each draw
Without replacement sampling creates subsets of size ⌊n/2⌋, selecting unique indices

The expected number of unique samples in a bootstrap sample is approximately n × (1 - 1/e) ≈ 0.632n, leaving about 36.8% of samples as "out-of-bag" samples that can be used for validation.

Your Task: Write a Python function that generates the specified number of random subsets from the input dataset. Each subset must maintain the correspondence between features and labels (i.e., if sample index i is selected, both X[i] and y[i] are included in the subset). Return the subsets as a list of tuples, where each tuple contains the feature subset and label subset as Python lists.

Reproducibility: Use the provided seed parameter to initialize the random number generator, ensuring consistent results across multiple runs.

With 5 samples and replacements=False, each subset contains n//2 = 2 samples drawn without replacement.

• Subset 1: Samples at indices [1, 4] → X_subset=[[3,4], [9,10]], y_subset=[2, 5] • Subset 2: Samples at indices [3, 1] → X_subset=[[7,8], [3,4]], y_subset=[4, 2] • Subset 3: Samples at indices [1, 0] → X_subset=[[3,4], [1,2]], y_subset=[2, 1]

Each subset is a tuple of (X_subset, y_subset), and no sample is repeated within a single subset.

With 4 samples and replacements=True (bootstrap sampling), each subset contains n = 4 samples drawn with replacement.

• Subset 1: Indices [2, 3, 0, 2] selected → X_subset=[[5,6], [7,8], [1,2], [5,6]], y_subset=[3, 4, 1, 3] Note: Index 2 appears twice in this bootstrap sample.

• Subset 2: Indices [2, 3, 0, 0] selected → X_subset=[[5,6], [7,8], [1,2], [1,2]], y_subset=[3, 4, 1, 1] Note: Index 0 appears twice in this bootstrap sample.

This demonstrates how bootstrap sampling allows repeated selection of the same sample.

With 4 samples and replacements=False, the single subset contains 4//2 = 2 samples.

• Subset 1: Indices [1, 3] selected → X_subset=[[0,1], [0,0]], y_subset=[1, 0]

This represents a binary feature dataset (like XOR logic inputs) with a subset selected for training.

Sampling Strategies:

With Replacement (Bootstrap Sampling): When replacements = True, each subset contains n samples drawn randomly with replacement. This means the same sample can appear multiple times within a single subset—a cornerstone technique in bootstrap aggregating (bagging).
Without Replacement: When replacements = False, each subset contains n // 2 samples (integer division) drawn randomly without replacement. Each sample appears at most once within a subset.

Mathematical Foundation:

For a dataset with n samples:

Bootstrap sampling (with replacement) creates subsets of size n, where each sample has probability 1/n of being selected at each draw
Without replacement sampling creates subsets of size ⌊n/2⌋, selecting unique indices