Loading content...
Gradient-based optimization is the backbone of modern machine learning. In this problem, you will implement a complete training pipeline for a binary classification model using the gradient descent algorithm to minimize the binary cross-entropy (log-loss) function.
A binary linear classifier computes predictions using the sigmoid function applied to a linear combination of features:
$$\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}$$
where the linear combination $z$ is computed as:
$$z = w_0 + w_1 x_1 + w_2 x_2 + \ldots + w_m x_m = \mathbf{w}^T \mathbf{x}$$
Here, $w_0$ is the bias term (intercept), and $w_1, w_2, \ldots, w_m$ are the feature weights.
The binary cross-entropy loss (also called log-loss) measures how well the predicted probabilities match the true binary labels:
$$\mathcal{L} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]$$
This loss function is convex, making it ideal for gradient-based optimization.
The gradient of the loss with respect to each weight is:
$$\frac{\partial \mathcal{L}}{\partial w_j} = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}i - y_i) \cdot x{ij}$$
Each weight is updated iteratively:
$$w_j := w_j - \alpha \cdot \frac{\partial \mathcal{L}}{\partial w_j}$$
where $\alpha$ is the learning rate.
Your task is to implement a function that:
Important Considerations:
iterations values (one per iteration, computed before the update)X = [[1.0, 0.5], [-0.5, -1.5], [2.0, 1.5], [-2.0, -1.0]]
y = [1, 0, 1, 0]
learning_rate = 0.01
iterations = 5([-0.0, 0.0338, 0.0277], [0.6931, 0.6853, 0.6776, 0.67, 0.6625])Step-by-step breakdown:
Data Preparation: The feature matrix X (4×2) is augmented with a bias column, becoming a 4×3 matrix. All coefficients [w₀, w₁, w₂] are initialized to [0, 0, 0].
Initial State: With all weights at zero, the sigmoid function outputs 0.5 for all samples. The initial log-loss is -log(0.5) = 0.6931.
Gradient Computation: The error (predicted - actual) is computed for each sample, then multiplied by the feature values to get gradients.
Weight Updates: After 5 iterations of gradient descent with α = 0.01:
Loss Trajectory: The loss decreases monotonically from 0.6931 → 0.6625, confirming the model is learning.
X = [[1.0, 2.0], [2.0, 3.0], [-1.0, -2.0], [-2.0, -1.0], [0.5, 1.0], [-0.5, -1.0]]
y = [1, 1, 0, 0, 1, 0]
learning_rate = 0.1
iterations = 10([-0.0151, 0.3711, 0.5311], [0.6931, 0.5959, 0.5205, 0.4614, 0.4143, 0.3761, 0.3446, 0.3183, 0.2959, 0.2767])Larger dataset with higher learning rate:
Dataset: 6 samples with 2 features each. The data shows clear separation—positive samples have positive feature values, negative samples have negative values.
Faster Convergence: With learning_rate = 0.1 (10× higher), the model learns much faster. The loss drops significantly from 0.6931 to 0.2767 in 10 iterations.
Final Weights: The coefficients [-0.0151, 0.3711, 0.5311] show:
Smooth Convergence: The loss decreases monotonically without oscillation, indicating the learning rate is well-tuned for this dataset.
X = [[1.0], [2.0], [-1.0], [-2.0], [0.5], [-0.5]]
y = [1, 1, 0, 0, 1, 0]
learning_rate = 0.05
iterations = 8([-0.0, 0.2163], [0.6931, 0.6763, 0.6602, 0.6448, 0.6301, 0.616, 0.6025, 0.5895])Single-feature classification:
Simpler Model: With only one feature, the augmented matrix is 6×2, and we optimize two coefficients: bias w₀ and weight w₁.
Moderate Learning: With learning_rate = 0.05, the loss decreases steadily from 0.6931 to 0.5895 over 8 iterations.
Final Model: The coefficients [-0.0, 0.2163] define a simple decision boundary:
Learning Dynamics: The smaller learning rate results in more gradual optimization compared to Example 2, trading convergence speed for stability.
Constraints