Loading content...
In machine learning, building predictive models that are both accurate and interpretable is a fundamental challenge. One powerful technique to achieve this balance is L1-regularized linear regression, commonly known as the Lasso (Least Absolute Shrinkage and Selection Operator) method.
Unlike standard linear regression, which can produce models with many non-zero coefficients, L1 regularization adds a penalty proportional to the absolute values of the model weights. This penalty serves two crucial purposes:
The Optimization Objective
The algorithm seeks to minimize a composite loss function that balances prediction accuracy with model simplicity:
$$J(\mathbf{w}, b) = \frac{1}{2n} \sum_{i=1}^{n} \left( y_i - \left( \sum_{j=1}^{p} X_{ij} w_j + b \right) \right)^2 + \alpha \sum_{j=1}^{p} |w_j|$$
Where:
Gradient Descent Optimization
Since the L1 penalty is not differentiable at zero, we use the subgradient method. For each weight update, the gradient of the L1 term uses the sign function:
$$\frac{\partial}{\partial w_j} |w_j| = \text{sign}(w_j)$$
The update rules become:
Where η is the learning rate.
Your Task
Implement the gradient descent optimization algorithm to train an L1-regularized linear regression model. Your function should iteratively update the weights and bias to minimize the objective function, returning the final optimized parameters.
X = [[0.0, 0.0], [1.0, 1.0], [2.0, 2.0]]
y = [0.0, 1.0, 2.0]
alpha = 0.1
learning_rate = 0.01
max_iter = 1000weights = [0.42371644, 0.42371644], bias = 0.15385068With two identical features (both columns have the same values), the L1 regularization distributes the learned weight equally between them. Each feature receives approximately 0.424 of the total weight. The bias term of ~0.154 adjusts the intercept to minimize the overall prediction error. The symmetric weight distribution demonstrates how L1 regularization handles perfectly correlated features.
X = [[1.0], [2.0], [3.0], [4.0], [5.0]]
y = [2.0, 4.0, 6.0, 8.0, 10.0]
alpha = 0.01
learning_rate = 0.01
max_iter = 1000weights = [1.96953246], bias = 0.10694592This represents a perfect linear relationship where y = 2x. The learned weight of ~1.97 is slightly less than 2.0 due to the L1 penalty pushing the coefficient toward zero. The small regularization strength (α = 0.01) allows the weight to remain close to the true value while providing some shrinkage. The small bias (~0.107) compensates for this shrinkage effect.
X = [[1.0, 2.0], [2.0, 3.0], [3.0, 4.0], [4.0, 5.0]]
y = [5.0, 8.0, 11.0, 14.0]
alpha = 0.05
learning_rate = 0.01
max_iter = 500weights = [1.26664079, 1.6892517], bias = 0.42261091The true relationship is y = x₁ + 2x₂. After 500 iterations with moderate L1 regularization (α = 0.05), the model learns weights that approximate this pattern. The second feature receives a larger weight (~1.69 vs ~1.27) because it contributes more to the target. The L1 penalty causes both weights to be slightly shrunk from their true values, with the model achieving a balance between fitting the data and keeping weights small.
Constraints