Optimizing Linear Model Coefficients via Gradient Descent (Easy) — Practice with Code Visualizer

Gradient descent is one of the most fundamental optimization algorithms in machine learning. It provides an iterative approach to finding the minimum of a function by repeatedly taking steps proportional to the negative of the gradient (or slope) at the current point.

In the context of linear regression, gradient descent is used to find the optimal coefficients (weights) that minimize the difference between predicted values and actual target values. This difference is typically measured using the Mean Squared Error (MSE) cost function.

The Linear Model

A linear regression model predicts outputs using the equation:

$$\hat{y} = X \cdot \theta$$

Where:

X is the feature matrix of shape (n × m), with n samples and m features (the first column is typically all ones to represent the intercept/bias term)
θ (theta) is the coefficient vector of shape (m,) containing the weights to be learned
ŷ (y-hat) is the vector of predicted values

The Gradient Descent Algorithm

The algorithm works by iteratively updating the coefficients in the direction that reduces the cost function:

Initialize coefficients θ (typically to zeros)
For each iteration:
- Compute predictions: (\hat{y} = X \cdot \theta)
- Calculate error: (e = \hat{y} - y)
- Compute gradient: ( abla = \frac{1}{n} X^T \cdot e)
- Update coefficients: (\theta = \theta - \alpha \cdot abla)

Where α (alpha) is the learning rate that controls the step size.

Your Task

Implement a function that performs linear regression using the gradient descent optimization algorithm. The function should:

Accept a feature matrix X (with an intercept column of ones), target vector y, learning rate alpha, and number of iterations
Return the optimized coefficient vector θ, with each value rounded to 4 decimal places
Handle the edge case where -0.0 may result from rounding very small negative numbers (this is valid output)

We have 3 data points with a single feature (plus intercept column). The target values show a perfect linear relationship y = x.

After 1000 iterations of gradient descent with learning rate 0.01: • The intercept coefficient converges to approximately 0.1107 • The slope coefficient converges to approximately 0.9513

Note: With more iterations or different learning rates, these would converge closer to the true values of [0, 1]. The gradient descent is still converging toward the optimal solution.

This dataset follows the relationship y = 2 + 3x (intercept of 2, slope of 3).

With 5 data points and the same hyperparameters: • The intercept coefficient reaches approximately 1.8 • The slope coefficient reaches approximately 3.0554

The algorithm is converging toward the true parameters [2.0, 3.0]. Additional iterations would bring the coefficients even closer to these optimal values.

This is a multivariate case with 4 data points and 2 features (plus intercept). The target follows y = 1.5x₁ + 1.5x₂ approximately.

After optimization: • Intercept: 0.0986 • Coefficient for feature 1: 1.4834 • Coefficient for feature 2: 1.4834

Notice both feature coefficients are equal because the features have identical values—the algorithm distributes the weight equally between correlated features.