Parameter Optimization Using Gradient-Based Algorithms (Medium) — Practice with Code Visualizer

Gradient-based optimization is the cornerstone of modern machine learning, enabling models to learn from data by iteratively adjusting parameters to minimize a loss function. In this problem, you will implement three fundamental variants of gradient-based parameter optimization, each with distinct characteristics that make them suitable for different scenarios.

The Core Concept

Given a dataset with n samples and m features, we seek optimal parameters (weights) w that minimize the Mean Squared Error (MSE) between predictions and actual values:

$$MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

where (\hat{y}_i = X_i \cdot w) represents the model's prediction for sample (i).

The gradient of the MSE with respect to the weights indicates the direction of steepest ascent. By moving in the opposite direction (negative gradient), we reduce the loss:

$$ abla_w MSE = -\frac{2}{n} X^T (y - Xw)$$

The weight update rule becomes:

$$w_{new} = w_{old} - \eta \cdot abla_w MSE$$

where (\eta) is the learning rate controlling the step size.

Three Optimization Strategies

1. Full-Batch Optimization (`method='batch'`)

Computes the gradient using all samples simultaneously before making a single weight update. This provides the most accurate gradient estimate but requires processing the entire dataset per update.

Per-epoch procedure:

Calculate predictions for all samples: (\hat{y} = Xw)
Compute error: (e = y - \hat{y})
Calculate gradient: (g = -\frac{2}{n} X^T e)
Update weights: (w = w - \eta \cdot g)

2. Single-Sample Optimization (`method='stochastic'`)

Updates weights after every individual sample, processing samples in their original order (no shuffling). This creates rapid but noisy updates that can help escape local minima.

Per-epoch procedure:

For each sample (i) from 1 to n:
- Calculate prediction: (\hat{y}_i = X_i \cdot w)
- Compute error: (e_i = y_i - \hat{y}_i)
- Calculate gradient: (g_i = -2 \cdot e_i \cdot X_i)
- Update weights: (w = w - \eta \cdot g_i)

3. Grouped-Sample Optimization (`method='mini_batch'`)

A hybrid approach that updates weights after processing groups of samples (batches). Balances the stability of full-batch with the speed of single-sample methods.

Per-epoch procedure:

Divide data into consecutive batches of size (b)
For each batch:
- Calculate predictions for batch samples
- Compute batch error
- Calculate gradient using batch size: (g = -\frac{2}{b} X_{batch}^T e_{batch})
- Update weights: (w = w - \eta \cdot g)

Note: If the final batch has fewer samples than the batch size, use the actual number of remaining samples.

Important Constraints

Do not shuffle the data at any point
Process samples in their original sequential order
The n_epochs parameter specifies how many complete passes through the dataset to perform
Round final weights to 4 decimal places

Using full-batch optimization, the algorithm processes all 4 samples together in each epoch. After 100 epochs with a learning rate of 0.01, the weights converge to approximately [1.1491, 0.5618]. These weights represent a linear model y ≈ 1.1491·x₁ + 0.5618·x₂, which closely fits the pattern y = x₁ + 1 in the training data.

Using single-sample optimization, weights are updated after each individual sample. This results in 400 total updates (4 samples × 100 epochs). The final weights [1.0508, 0.8366] differ from batch optimization due to the noisier gradient estimates, but still approximate the linear relationship in the data.

Using grouped-sample optimization with batch_size=2, the 4 samples are divided into 2 batches: samples 0-1 and samples 2-3. Each epoch performs 2 weight updates. The final weights [1.1033, 0.6833] fall between the batch and stochastic results, offering a balance between gradient accuracy and update frequency.

The Core Concept

Given a dataset with n samples and m features, we seek optimal parameters (weights) w that minimize the Mean Squared Error (MSE) between predictions and actual values:

$$MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

where (\hat{y}_i = X_i \cdot w) represents the model's prediction for sample (i).

The gradient of the MSE with respect to the weights indicates the direction of steepest ascent. By moving in the opposite direction (negative gradient), we reduce the loss:

$$ abla_w MSE = -\frac{2}{n} X^T (y - Xw)$$

The weight update rule becomes:

$$w_{new} = w_{old} - \eta \cdot abla_w MSE$$

where (\eta) is the learning rate controlling the step size.

Three Optimization Strategies

1. Full-Batch Optimization (`method='batch'`)

Per-epoch procedure:

Calculate predictions for all samples: (\hat{y} = Xw)
Compute error: (e = y - \hat{y})
Calculate gradient: (g = -\frac{2}{n} X^T e)
Update weights: (w = w - \eta \cdot g)

2. Single-Sample Optimization (`method='stochastic'`)

Updates weights after every individual sample, processing samples in their original order (no shuffling). This creates rapid but noisy updates that can help escape local minima.

Per-epoch procedure:

For each sample (i) from 1 to n:
- Calculate prediction: (\hat{y}_i = X_i \cdot w)
- Compute error: (e_i = y_i - \hat{y}_i)
- Calculate gradient: (g_i = -2 \cdot e_i \cdot X_i)
- Update weights: (w = w - \eta \cdot g_i)

3. Grouped-Sample Optimization (`method='mini_batch'`)

A hybrid approach that updates weights after processing groups of samples (batches). Balances the stability of full-batch with the speed of single-sample methods.

Per-epoch procedure:

Divide data into consecutive batches of size (b)
For each batch:
- Calculate predictions for batch samples
- Compute batch error
- Calculate gradient using batch size: (g = -\frac{2}{b} X_{batch}^T e_{batch})
- Update weights: (w = w - \eta \cdot g)

Note: If the final batch has fewer samples than the batch size, use the actual number of remaining samples.

Important Constraints

Do not shuffle the data at any point
Process samples in their original sequential order
The n_epochs parameter specifies how many complete passes through the dataset to perform
Round final weights to 4 decimal places

Parameter Optimization Using Gradient-Based Algorithms

The Core Concept

Three Optimization Strategies

1. Full-Batch Optimization (method='batch')

2. Single-Sample Optimization (method='stochastic')

3. Grouped-Sample Optimization (method='mini_batch')

Important Constraints

Hints

Parameter Optimization Using Gradient-Based Algorithms

The Core Concept

Three Optimization Strategies

1. Full-Batch Optimization (method='batch')

2. Single-Sample Optimization (method='stochastic')

3. Grouped-Sample Optimization (method='mini_batch')

Important Constraints

Hints

1. Full-Batch Optimization (`method='batch'`)

2. Single-Sample Optimization (`method='stochastic'`)

3. Grouped-Sample Optimization (`method='mini_batch'`)

1. Full-Batch Optimization (`method='batch'`)

2. Single-Sample Optimization (`method='stochastic'`)

3. Grouped-Sample Optimization (`method='mini_batch'`)