Loading problem...
Gradient-based optimization is the cornerstone of modern machine learning, enabling models to learn from data by iteratively adjusting parameters to minimize a loss function. In this problem, you will implement three fundamental variants of gradient-based parameter optimization, each with distinct characteristics that make them suitable for different scenarios.
Given a dataset with n samples and m features, we seek optimal parameters (weights) w that minimize the Mean Squared Error (MSE) between predictions and actual values:
$$MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$
where (\hat{y}_i = X_i \cdot w) represents the model's prediction for sample (i).
The gradient of the MSE with respect to the weights indicates the direction of steepest ascent. By moving in the opposite direction (negative gradient), we reduce the loss:
$$ abla_w MSE = -\frac{2}{n} X^T (y - Xw)$$
The weight update rule becomes:
$$w_{new} = w_{old} - \eta \cdot abla_w MSE$$
where (\eta) is the learning rate controlling the step size.
method='batch')Computes the gradient using all samples simultaneously before making a single weight update. This provides the most accurate gradient estimate but requires processing the entire dataset per update.
Per-epoch procedure:
method='stochastic')Updates weights after every individual sample, processing samples in their original order (no shuffling). This creates rapid but noisy updates that can help escape local minima.
Per-epoch procedure:
method='mini_batch')A hybrid approach that updates weights after processing groups of samples (batches). Balances the stability of full-batch with the speed of single-sample methods.
Per-epoch procedure:
Note: If the final batch has fewer samples than the batch size, use the actual number of remaining samples.
n_epochs parameter specifies how many complete passes through the dataset to performX = [[1.0, 1.0], [2.0, 1.0], [3.0, 1.0], [4.0, 1.0]]
y = [2.0, 3.0, 4.0, 5.0]
weights = [0.0, 0.0]
learning_rate = 0.01
n_epochs = 100
method = "batch"[1.1491, 0.5618]Using full-batch optimization, the algorithm processes all 4 samples together in each epoch. After 100 epochs with a learning rate of 0.01, the weights converge to approximately [1.1491, 0.5618]. These weights represent a linear model y ≈ 1.1491·x₁ + 0.5618·x₂, which closely fits the pattern y = x₁ + 1 in the training data.
X = [[1.0, 1.0], [2.0, 1.0], [3.0, 1.0], [4.0, 1.0]]
y = [2.0, 3.0, 4.0, 5.0]
weights = [0.0, 0.0]
learning_rate = 0.01
n_epochs = 100
method = "stochastic"[1.0508, 0.8366]Using single-sample optimization, weights are updated after each individual sample. This results in 400 total updates (4 samples × 100 epochs). The final weights [1.0508, 0.8366] differ from batch optimization due to the noisier gradient estimates, but still approximate the linear relationship in the data.
X = [[1.0, 1.0], [2.0, 1.0], [3.0, 1.0], [4.0, 1.0]]
y = [2.0, 3.0, 4.0, 5.0]
weights = [0.0, 0.0]
batch_size = 2
learning_rate = 0.01
n_epochs = 100
method = "mini_batch"[1.1033, 0.6833]Using grouped-sample optimization with batch_size=2, the 4 samples are divided into 2 batches: samples 0-1 and samples 2-3. Each epoch performs 2 weight updates. The final weights [1.1033, 0.6833] fall between the batch and stochastic results, offering a balance between gradient accuracy and update frequency.
Constraints