00:00:00

Description

Editorial

Data Partitioning with Cross-Validation Splits

MEDIUM25 pts

Cross-validation is a fundamental model evaluation technique in machine learning that provides robust estimates of how a model will generalize to unseen data. Unlike a simple train-test split, cross-validation systematically rotates through different portions of the data for training and validation, ensuring every data point is used for both purposes exactly once.

The K-Fold Cross-Validation Technique

In k-fold cross-validation, the dataset is divided into k non-overlapping partitions (called folds) of roughly equal size. The algorithm then performs k iterations:

In each iteration i, fold i is held out as the validation set (also called the test set)
The remaining k-1 folds are combined to form the training set
This process repeats until every fold has served as the validation set exactly once

The key insight is that this approach gives us k different performance measurements, which can be averaged to obtain a more reliable estimate of model performance than a single train-test split.

Handling Data Order

The shuffle parameter controls whether the indices are randomly permuted before partitioning:

shuffle=True: Indices are shuffled using np.random.shuffle() before splitting. This helps when data might be ordered by class or time, ensuring each fold contains a representative sample.
shuffle=False: Indices maintain their original order. This is appropriate for time-series data or when reproducibility without randomness is required.

Mathematical Formulation

Given a dataset with n samples and k folds:

Samples per fold: approximately ⌈n/k⌉ or ⌊n/k⌋ samples (handles non-divisible cases)
Training set size per iteration: approximately n × (k-1)/k samples
Validation set size per iteration: approximately n/k samples

Your Task

Implement a function that generates the train-validation index splits for k-fold cross-validation. For each of the k iterations, return the lists of indices that form the training set and validation set.

Example

Input

X = [[0], [1], [2], [3], [4], [5], [6], [7], [8], [9]]
y = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
k = 5
shuffle = False

Output

[([2, 3, 4, 5, 6, 7, 8, 9], [0, 1]), ([0, 1, 4, 5, 6, 7, 8, 9], [2, 3]), ([0, 1, 2, 3, 6, 7, 8, 9], [4, 5]), ([0, 1, 2, 3, 4, 5, 8, 9], [6, 7]), ([0, 1, 2, 3, 4, 5, 6, 7], [8, 9])]

Explanation

With 10 samples and k=5 folds (no shuffling), each fold contains 2 consecutive indices:

• Fold 0: Indices [0, 1] → validation; Indices [2-9] → training • Fold 1: Indices [2, 3] → validation; Indices [0, 1, 4-9] → training • Fold 2: Indices [4, 5] → validation; Indices [0-3, 6-9] → training • Fold 3: Indices [6, 7] → validation; Indices [0-5, 8, 9] → training • Fold 4: Indices [8, 9] → validation; Indices [0-7] → training

Each sample appears in exactly one validation fold and (k-1)=4 training folds.

Example

Input

X = [[0], [1], [2], [3], [4], [5], [6], [7]]
y = [0, 1, 2, 3, 4, 5, 6, 7]
k = 4
shuffle = True

Output

[([0, 7, 2, 4, 3, 6], [1, 5]), ([1, 5, 2, 4, 3, 6], [0, 7]), ([1, 5, 0, 7, 3, 6], [2, 4]), ([1, 5, 0, 7, 2, 4], [3, 6])]

Explanation

With shuffle=True, the indices are first randomly shuffled (using seed 42) before partitioning. The shuffled order becomes [1, 5, 0, 7, 2, 4, 3, 6].

With 8 samples and k=4 folds, each fold contains 2 samples from this shuffled order:

• Fold 0: Shuffled positions [0, 1] → indices [1, 5] for validation • Fold 1: Shuffled positions [2, 3] → indices [0, 7] for validation • Fold 2: Shuffled positions [4, 5] → indices [2, 4] for validation • Fold 3: Shuffled positions [6, 7] → indices [3, 6] for validation

The training set for each fold consists of all indices not in the validation set.

Example

Input

X = [[0], [1], [2], [3], [4], [5], [6]]
y = [0, 1, 2, 3, 4, 5, 6]
k = 3
shuffle = False

Output

[([3, 4, 5, 6], [0, 1, 2]), ([0, 1, 2, 5, 6], [3, 4]), ([0, 1, 2, 3, 4], [5, 6])]

Explanation

With 7 samples and k=3 folds, the samples cannot be divided evenly. The folds are distributed as follows:

• Fold 0: 3 samples (indices 0, 1, 2) • Fold 1: 2 samples (indices 3, 4) • Fold 2: 2 samples (indices 5, 6)

This demonstrates handling of non-divisible dataset sizes, where earlier folds may contain one extra sample (⌈7/3⌉ = 3 vs ⌊7/3⌋ = 2).

Accepted0/0·0% Acceptance

Constraints

2 ≤ n ≤ 10,000 (number of samples)
2 ≤ k ≤ n (number of folds)
X is a 2D numpy array of shape (n, d) where d ≥ 1
y is a 1D numpy array of length n
When shuffle=True, a random seed will be set externally before your function is called
The function must return exactly k (train_indices, validation_indices) pairs
Each index from 0 to n-1 must appear in exactly one validation set across all folds
Each index must appear in exactly k-1 training sets across all folds

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

X =

[[0],[1],[2],[3],[4],[5],[6],[7],[8],[9]]

k =

y =

[0,1,2,3,4,5,6,7,8,9]

shuffle =

false

Loading problem...

101

00:00:00

Description

Editorial

Data Partitioning with Cross-Validation Splits

MEDIUM25 pts

The K-Fold Cross-Validation Technique

In k-fold cross-validation, the dataset is divided into k non-overlapping partitions (called folds) of roughly equal size. The algorithm then performs k iterations:

In each iteration i, fold i is held out as the validation set (also called the test set)
The remaining k-1 folds are combined to form the training set
This process repeats until every fold has served as the validation set exactly once

The key insight is that this approach gives us k different performance measurements, which can be averaged to obtain a more reliable estimate of model performance than a single train-test split.

Handling Data Order

The shuffle parameter controls whether the indices are randomly permuted before partitioning:

shuffle=True: Indices are shuffled using np.random.shuffle() before splitting. This helps when data might be ordered by class or time, ensuring each fold contains a representative sample.
shuffle=False: Indices maintain their original order. This is appropriate for time-series data or when reproducibility without randomness is required.

Mathematical Formulation

Given a dataset with n samples and k folds:

Samples per fold: approximately ⌈n/k⌉ or ⌊n/k⌋ samples (handles non-divisible cases)
Training set size per iteration: approximately n × (k-1)/k samples
Validation set size per iteration: approximately n/k samples

Your Task

Example

Input

X = [[0], [1], [2], [3], [4], [5], [6], [7], [8], [9]]
y = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
k = 5
shuffle = False

Output

[([2, 3, 4, 5, 6, 7, 8, 9], [0, 1]), ([0, 1, 4, 5, 6, 7, 8, 9], [2, 3]), ([0, 1, 2, 3, 6, 7, 8, 9], [4, 5]), ([0, 1, 2, 3, 4, 5, 8, 9], [6, 7]), ([0, 1, 2, 3, 4, 5, 6, 7], [8, 9])]

Explanation

With 10 samples and k=5 folds (no shuffling), each fold contains 2 consecutive indices:

Each sample appears in exactly one validation fold and (k-1)=4 training folds.

Example

Input

X = [[0], [1], [2], [3], [4], [5], [6], [7]]
y = [0, 1, 2, 3, 4, 5, 6, 7]
k = 4
shuffle = True

Output

[([0, 7, 2, 4, 3, 6], [1, 5]), ([1, 5, 2, 4, 3, 6], [0, 7]), ([1, 5, 0, 7, 3, 6], [2, 4]), ([1, 5, 0, 7, 2, 4], [3, 6])]

Explanation

With shuffle=True, the indices are first randomly shuffled (using seed 42) before partitioning. The shuffled order becomes [1, 5, 0, 7, 2, 4, 3, 6].

With 8 samples and k=4 folds, each fold contains 2 samples from this shuffled order:

The training set for each fold consists of all indices not in the validation set.

Example

Input

X = [[0], [1], [2], [3], [4], [5], [6]]
y = [0, 1, 2, 3, 4, 5, 6]
k = 3
shuffle = False

Output

[([3, 4, 5, 6], [0, 1, 2]), ([0, 1, 2, 5, 6], [3, 4]), ([0, 1, 2, 3, 4], [5, 6])]

Explanation

With 7 samples and k=3 folds, the samples cannot be divided evenly. The folds are distributed as follows:

• Fold 0: 3 samples (indices 0, 1, 2) • Fold 1: 2 samples (indices 3, 4) • Fold 2: 2 samples (indices 5, 6)

This demonstrates handling of non-divisible dataset sizes, where earlier folds may contain one extra sample (⌈7/3⌉ = 3 vs ⌊7/3⌋ = 2).

Accepted0/0·0% Acceptance

Constraints

2 ≤ n ≤ 10,000 (number of samples)
2 ≤ k ≤ n (number of folds)
X is a 2D numpy array of shape (n, d) where d ≥ 1
y is a 1D numpy array of length n
When shuffle=True, a random seed will be set externally before your function is called
The function must return exactly k (train_indices, validation_indices) pairs
Each index from 0 to n-1 must appear in exactly one validation set across all folds
Each index must appear in exactly k-1 training sets across all folds

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

X =

[[0],[1],[2],[3],[4],[5],[6],[7],[8],[9]]

k =

y =

[0,1,2,3,4,5,6,7,8,9]

shuffle =

false

Data Partitioning with Cross-Validation Splits

The K-Fold Cross-Validation Technique

Handling Data Order

Mathematical Formulation

Your Task

Hints

Data Partitioning with Cross-Validation Splits

The K-Fold Cross-Validation Technique

Handling Data Order

Mathematical Formulation

Your Task

Hints