Loading problem...
Cross-validation is a fundamental model evaluation technique in machine learning that provides robust estimates of how a model will generalize to unseen data. Unlike a simple train-test split, cross-validation systematically rotates through different portions of the data for training and validation, ensuring every data point is used for both purposes exactly once.
In k-fold cross-validation, the dataset is divided into k non-overlapping partitions (called folds) of roughly equal size. The algorithm then performs k iterations:
The key insight is that this approach gives us k different performance measurements, which can be averaged to obtain a more reliable estimate of model performance than a single train-test split.
The shuffle parameter controls whether the indices are randomly permuted before partitioning:
np.random.shuffle() before splitting. This helps when data might be ordered by class or time, ensuring each fold contains a representative sample.Given a dataset with n samples and k folds:
Implement a function that generates the train-validation index splits for k-fold cross-validation. For each of the k iterations, return the lists of indices that form the training set and validation set.
X = [[0], [1], [2], [3], [4], [5], [6], [7], [8], [9]]
y = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
k = 5
shuffle = False[([2, 3, 4, 5, 6, 7, 8, 9], [0, 1]), ([0, 1, 4, 5, 6, 7, 8, 9], [2, 3]), ([0, 1, 2, 3, 6, 7, 8, 9], [4, 5]), ([0, 1, 2, 3, 4, 5, 8, 9], [6, 7]), ([0, 1, 2, 3, 4, 5, 6, 7], [8, 9])]With 10 samples and k=5 folds (no shuffling), each fold contains 2 consecutive indices:
• Fold 0: Indices [0, 1] → validation; Indices [2-9] → training • Fold 1: Indices [2, 3] → validation; Indices [0, 1, 4-9] → training • Fold 2: Indices [4, 5] → validation; Indices [0-3, 6-9] → training • Fold 3: Indices [6, 7] → validation; Indices [0-5, 8, 9] → training • Fold 4: Indices [8, 9] → validation; Indices [0-7] → training
Each sample appears in exactly one validation fold and (k-1)=4 training folds.
X = [[0], [1], [2], [3], [4], [5], [6], [7]]
y = [0, 1, 2, 3, 4, 5, 6, 7]
k = 4
shuffle = True[([0, 7, 2, 4, 3, 6], [1, 5]), ([1, 5, 2, 4, 3, 6], [0, 7]), ([1, 5, 0, 7, 3, 6], [2, 4]), ([1, 5, 0, 7, 2, 4], [3, 6])]With shuffle=True, the indices are first randomly shuffled (using seed 42) before partitioning. The shuffled order becomes [1, 5, 0, 7, 2, 4, 3, 6].
With 8 samples and k=4 folds, each fold contains 2 samples from this shuffled order:
• Fold 0: Shuffled positions [0, 1] → indices [1, 5] for validation • Fold 1: Shuffled positions [2, 3] → indices [0, 7] for validation • Fold 2: Shuffled positions [4, 5] → indices [2, 4] for validation • Fold 3: Shuffled positions [6, 7] → indices [3, 6] for validation
The training set for each fold consists of all indices not in the validation set.
X = [[0], [1], [2], [3], [4], [5], [6]]
y = [0, 1, 2, 3, 4, 5, 6]
k = 3
shuffle = False[([3, 4, 5, 6], [0, 1, 2]), ([0, 1, 2, 5, 6], [3, 4]), ([0, 1, 2, 3, 4], [5, 6])]With 7 samples and k=3 folds, the samples cannot be divided evenly. The folds are distributed as follows:
• Fold 0: 3 samples (indices 0, 1, 2) • Fold 1: 2 samples (indices 3, 4) • Fold 2: 2 samples (indices 5, 6)
This demonstrates handling of non-divisible dataset sizes, where earlier folds may contain one extra sample (⌈7/3⌉ = 3 vs ⌊7/3⌋ = 2).
Constraints