0/318

00:00:00

Description

Editorial

Binary Classifier Gradient Optimizer

HARD30 pts

Gradient-based optimization is the backbone of modern machine learning. In this problem, you will implement a complete training pipeline for a binary classification model using the gradient descent algorithm to minimize the binary cross-entropy (log-loss) function.

Understanding the Model

A binary linear classifier computes predictions using the sigmoid function applied to a linear combination of features:

$$\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}$$

where the linear combination $z$ is computed as:

$$z = w_0 + w_1 x_1 + w_2 x_2 + \ldots + w_m x_m = \mathbf{w}^T \mathbf{x}$$

Here, $w_0$ is the bias term (intercept), and $w_1, w_2, \ldots, w_m$ are the feature weights.

The Loss Function

The binary cross-entropy loss (also called log-loss) measures how well the predicted probabilities match the true binary labels:

$$\mathcal{L} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]$$

This loss function is convex, making it ideal for gradient-based optimization.

Gradient Descent Update Rule

The gradient of the loss with respect to each weight is:

$$\frac{\partial \mathcal{L}}{\partial w_j} = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}i - y_i) \cdot x{ij}$$

Each weight is updated iteratively:

$$w_j := w_j - \alpha \cdot \frac{\partial \mathcal{L}}{\partial w_j}$$

where $\alpha$ is the learning rate.

Implementation Details

Your task is to implement a function that:

Augments the feature matrix by prepending a column of ones to handle the bias term
Initializes all coefficients (weights and bias) to zero
Iteratively performs gradient descent updates for the specified number of iterations
Records the loss value at each iteration (computed before the weight update)
Returns the final coefficients and the complete loss history

Important Considerations:

To prevent numerical instability when computing log(0), clip predicted probabilities to the range [1e-15, 1 - 1e-15]
Round all output values to 4 decimal places
The loss history should contain exactly iterations values (one per iteration, computed before the update)

Example

Input

X = [[1.0, 0.5], [-0.5, -1.5], [2.0, 1.5], [-2.0, -1.0]]
y = [1, 0, 1, 0]
learning_rate = 0.01
iterations = 5

Output

([-0.0, 0.0338, 0.0277], [0.6931, 0.6853, 0.6776, 0.67, 0.6625])

Explanation

Step-by-step breakdown:

Data Preparation: The feature matrix X (4×2) is augmented with a bias column, becoming a 4×3 matrix. All coefficients [w₀, w₁, w₂] are initialized to [0, 0, 0].
Initial State: With all weights at zero, the sigmoid function outputs 0.5 for all samples. The initial log-loss is -log(0.5) = 0.6931.
Gradient Computation: The error (predicted - actual) is computed for each sample, then multiplied by the feature values to get gradients.
Weight Updates: After 5 iterations of gradient descent with α = 0.01:
- Bias w₀ ≈ -0.0 (nearly zero, as positive and negative samples are balanced)
- Weight w₁ ≈ 0.0338 (positive, as higher x₁ values correlate with class 1)
- Weight w₂ ≈ 0.0277 (positive, as higher x₂ values correlate with class 1)
Loss Trajectory: The loss decreases monotonically from 0.6931 → 0.6625, confirming the model is learning.

Example

Input

X = [[1.0, 2.0], [2.0, 3.0], [-1.0, -2.0], [-2.0, -1.0], [0.5, 1.0], [-0.5, -1.0]]
y = [1, 1, 0, 0, 1, 0]
learning_rate = 0.1
iterations = 10

Output

([-0.0151, 0.3711, 0.5311], [0.6931, 0.5959, 0.5205, 0.4614, 0.4143, 0.3761, 0.3446, 0.3183, 0.2959, 0.2767])

Explanation

Larger dataset with higher learning rate:

Dataset: 6 samples with 2 features each. The data shows clear separation—positive samples have positive feature values, negative samples have negative values.
Faster Convergence: With learning_rate = 0.1 (10× higher), the model learns much faster. The loss drops significantly from 0.6931 to 0.2767 in 10 iterations.
Final Weights: The coefficients [-0.0151, 0.3711, 0.5311] show:
- Small negative bias (slight class imbalance in decision boundary location)
- Both feature weights are positive and substantial, indicating both features contribute to classification
- w₂ > w₁ suggests the second feature is more predictive
Smooth Convergence: The loss decreases monotonically without oscillation, indicating the learning rate is well-tuned for this dataset.

Example

Input

X = [[1.0], [2.0], [-1.0], [-2.0], [0.5], [-0.5]]
y = [1, 1, 0, 0, 1, 0]
learning_rate = 0.05
iterations = 8

Output

([-0.0, 0.2163], [0.6931, 0.6763, 0.6602, 0.6448, 0.6301, 0.616, 0.6025, 0.5895])

Explanation

Single-feature classification:

Simpler Model: With only one feature, the augmented matrix is 6×2, and we optimize two coefficients: bias w₀ and weight w₁.
Moderate Learning: With learning_rate = 0.05, the loss decreases steadily from 0.6931 to 0.5895 over 8 iterations.
Final Model: The coefficients [-0.0, 0.2163] define a simple decision boundary:
- Bias ≈ 0 means the decision boundary passes near the origin
- Positive weight (0.2163) correctly captures that higher feature values predict class 1
Learning Dynamics: The smaller learning rate results in more gradual optimization compared to Example 2, trading convergence speed for stability.

Accepted0/0·0% Acceptance

Constraints

1 ≤ n_samples ≤ 10,000 (number of training examples)
1 ≤ n_features ≤ 100 (number of input features)
0 < learning_rate ≤ 1.0 (step size for gradient descent)
1 ≤ iterations ≤ 10,000 (number of optimization steps)
y ∈ {0, 1} for all target values (binary classification)
-10⁶ ≤ X[i][j] ≤ 10⁶ (feature values)
All coefficients should be initialized to zero
Output values must be rounded to 4 decimal places

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

X =

[[1,0.5],[-0.5,-1.5],[2,1.5],[-2,-1]]

y =

[1,0,1,0]

iterations =

learning_rate =

0.01

Binary Classifier Gradient Optimizer

Understanding the Model

The Loss Function

Gradient Descent Update Rule

Implementation Details

Hints

Binary Classifier Gradient Optimizer

Understanding the Model

The Loss Function

Gradient Descent Update Rule

Implementation Details

Hints