Machine LearningBinary Classification Problem

Binary Classification Problem

LevelIntermediate

Duration60 mins

TopicBinary Classification Problem

4 / 5

Linear Classifiers

The Linear Hypothesis Class

Linear classifiers form the foundation of classification theory and practice. Despite their simplicity, they are surprisingly effective in many real-world applications and serve as the building blocks for more complex methods.

A linear classifier makes predictions based on a linear combination of features: $$f(\mathbf{x}) = \text{sign}(\mathbf{w}^T\mathbf{x} + b) = \text{sign}\left(\sum_{j=1}^d w_j x_j + b\right)$$

What You Will Learn

By the end of this page, you will understand the mathematical form and geometric interpretation of linear classifiers, the perceptron algorithm as the simplest linear learner, connections between linear classifiers and hyperplane geometry, and when linear classifiers succeed or fail.

Mathematical Foundation

The Linear Model

A linear classifier is characterized by:

Weight vector $\mathbf{w} \in \mathbb{R}^d$ — determines feature importance and direction
Bias $b \in \mathbb{R}$ — shifts the decision boundary from the origin

The decision function (also called discriminant function) is: $$g(\mathbf{x}) = \mathbf{w}^T\mathbf{x} + b = w_1 x_1 + w_2 x_2 + \cdots + w_d x_d + b$$

The prediction rule is: $$\hat{y} = \begin{cases} 1 & \text{if } g(\mathbf{x}) \geq 0 \ 0 & \text{if } g(\mathbf{x}) < 0 \end{cases}$$

Homogeneous Form

We can absorb the bias into the weight vector by augmenting features: $$\tilde{\mathbf{x}} = [1, x_1, x_2, \ldots, x_d]^T, \quad \tilde{\mathbf{w}} = [b, w_1, w_2, \ldots, w_d]^T$$

Then: $g(\mathbf{x}) = \tilde{\mathbf{w}}^T\tilde{\mathbf{x}}$

Linear Classifier Properties
Property	Value/Description
Number of parameters	$d + 1$ (weights + bias)
Decision boundary	Hyperplane in $\mathbb{R}^d$
Prediction complexity	$O(d)$ per instance
Training complexity	Varies by algorithm
VC dimension	$d + 1$

Feature Importance from Weights

When features are standardized (zero mean, unit variance), weight magnitudes directly indicate feature importance. Larger |wⱼ| means feature j has stronger influence. This interpretability is a major advantage of linear classifiers.

The Perceptron Algorithm

The perceptron (Rosenblatt, 1958) is the simplest linear classifier, learned through iterative error correction.

Algorithm (using $y \in {-1, +1}$)

Initialize $\mathbf{w} = \mathbf{0}$, $b = 0$
For each epoch:
- For each $(\mathbf{x}_i, y_i)$:
  - If $y_i(\mathbf{w}^T\mathbf{x}_i + b) \leq 0$ (misclassified):
    - $\mathbf{w} \leftarrow \mathbf{w} + \eta y_i \mathbf{x}_i$
    - $b \leftarrow b + \eta y_i$
Repeat until convergence or max iterations

The learning rate $\eta$ controls update magnitude (often $\eta = 1$).

perceptron.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import numpy as np
 
class Perceptron:
    """Simple perceptron classifier."""
    
    def __init__(self, learning_rate=1.0, max_epochs=1000):
        self.lr = learning_rate
        self.max_epochs = max_epochs
        
    def fit(self, X, y):
        """Train perceptron on data."""
        n_samples, n_features = X.shape
        
        # Convert labels to {-1, +1}
        y_signed = 2 * y - 1
        
        # Initialize weights
        self.w = np.zeros(n_features)
        self.b = 0.0
        
        for epoch in range(self.max_epochs):
            errors = 0
            for i in range(n_samples):
                if y_signed[i] * (self.w @ X[i] + self.b) <= 0:
                    # Misclassified - update
                    self.w += self.lr * y_signed[i] * X[i]
                    self.b += self.lr * y_signed[i]
                    errors += 1
            
            if errors == 0:
                print(f"Converged at epoch {epoch}")
                break
        
        return self
    
    def predict(self, X):
        """Make predictions."""
        scores = X @ self.w + self.b
        return (scores >= 0).astype(int)
 
# Example
np.random.seed(42)
X = np.vstack([np.random.randn(50, 2) + [2, 2],
               np.random.randn(50, 2) + [-2, -2]])
y = np.array([1]*50 + [0]*50)
 
model = Perceptron()
model.fit(X, y)
accuracy = (model.predict(X) == y).mean()
print(f"Training accuracy: {accuracy:.2%}")

Perceptron Convergence Theorem

If data is linearly separable with margin γ and all points have norm ≤ R, the perceptron converges in at most (R/γ)² updates. However, if data is not linearly separable, the perceptron never converges—it oscillates indefinitely.

Linear Separability

Definition

A dataset is linearly separable if there exists a hyperplane that perfectly separates the classes: $$\exists \mathbf{w}, b : y_i(\mathbf{w}^T\mathbf{x}_i + b) > 0 \quad \forall i$$

Testing for Linear Separability

Linear separability can be checked via linear programming: find $\mathbf{w}, b$ such that $y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1$ for all $i$. If feasible, data is separable.

Linear Separability in Practice
Dimension	Probability of Separability	Implication
$d < n$	Typically low	More samples than features → likely overlap
$d \approx n$	Moderate	May or may not be separable
$d \gg n$	Very high	High-dim space easily separates finite points

The XOR Problem

The classic example of non-linear-separability: points at $(0,0), (1,1)$ labeled negative and $(0,1), (1,0)$ labeled positive. No single line can separate them—demonstrating a fundamental limitation of linear classifiers.

Solutions for Non-Separable Data

Feature engineering — Add polynomial or interaction features
Kernel methods — Implicit mapping to high-dimensional space
Soft margins — Allow some misclassifications (SVM approach)
Probabilistic models — Logistic regression handles overlap gracefully

Geometry of Linear Classifiers

The Weight Vector as Normal

The weight vector $\mathbf{w}$ is perpendicular (normal) to the decision boundary. This means:

$\mathbf{w}$ points toward the positive class region
Moving in direction $\mathbf{w}$ increases $g(\mathbf{x})$
Moving perpendicular to $\mathbf{w}$ leaves $g(\mathbf{x})$ unchanged

Distance to the Boundary

For any point $\mathbf{x}$, the signed distance to the decision boundary is: $$\text{distance} = \frac{g(\mathbf{x})}{|\mathbf{w}|} = \frac{\mathbf{w}^T\mathbf{x} + b}{|\mathbf{w}|}$$

Positive for class 1 side, negative for class 0 side.

Geometric Insights

•Scaling invariance — Multiplying $\mathbf{w}$ and $b$ by constant $c > 0$ doesn't change predictions
•Decision boundary — The set where $\mathbf{w}^T\mathbf{x} + b = 0$
•Prediction confidence — Magnitude $|g(\mathbf{x})|$ indicates confidence
•Margin — Minimum distance from training points to boundary

Visualizing in 2D

In 2D, the boundary is a line: w₁x₁ + w₂x₂ + b = 0. The slope is -w₁/w₂ and intercept is -b/w₂. Points above/below the line get different predictions based on which side w points to.

Variants of Linear Classifiers

Several important algorithms produce linear classifiers, differing in their training objectives and resulting properties.

Linear Classifier Comparison
Algorithm	Loss Function	Output	Key Property
Perceptron	Hinge (margin 0)	Hard labels	Online, simple updates
Logistic Regression	Log loss	Probabilities	Calibrated, convex
Linear SVM	Hinge (margin 1)	Hard labels	Maximum margin
LDA	Generative/Gaussian	Probabilities	Optimal for Gaussian classes
Linear Regression + Threshold	Squared loss	Hard labels	Simple but suboptimal

Which to Choose?

Need probabilities? → Logistic regression or LDA
Large-scale data? → SVM with SGD or logistic regression
Theoretical guarantees? → SVM (margin bounds) or LDA (Gaussian assumption)
Interpretability? → Any linear classifier (coefficients are interpretable)
Online learning? → Perceptron or online gradient descent

When Linear Classifiers Succeed

Despite their simplicity, linear classifiers excel in several important settings:

Conditions Favoring Linear Classifiers

•High-dimensional sparse data — Text classification with bag-of-words often has d >> n, making linear boundaries sufficient
•Limited training data — Simpler models generalize better with small samples
•Interpretability requirements — Feature weights explain predictions directly
•Computational constraints — O(d) prediction, efficient training
•Approximately linear boundaries — Many real problems have near-linear structure
•Feature engineering applied — Good features can make problems linearly separable

The High-Dimensional Surprise

In high dimensions, randomly generated point clouds become linearly separable with high probability. This 'blessing of dimensionality' explains why linear classifiers work well for text and genomics data with thousands of features.

Summary

Key Takeaways

•Linear classifiers predict via sign of $\mathbf{w}^T\mathbf{x} + b$, creating hyperplane boundaries
•The perceptron learns through iterative error correction—converges for separable data
•Linear separability is achievable when hyperplane can perfectly separate classes
•Weight vector geometry — $\mathbf{w}$ is normal to boundary, determines feature importance
•Different algorithms (perceptron, logistic regression, SVM, LDA) produce linear classifiers via different objectives
•Linear models excel in high-dimensions, sparse data, and when interpretability matters

Linear Foundations Complete

You now understand linear classifiers thoroughly. This prepares you for the final piece: loss functions. Understanding how different losses shape classifier behavior completes our foundation for logistic regression.

4 / 5

Loading learning content...

Machine LearningBinary Classification Problem

Binary Classification Problem

LevelIntermediate

Duration60 mins

TopicBinary Classification Problem

4 / 5

Linear Classifiers

The Linear Hypothesis Class

A linear classifier makes predictions based on a linear combination of features: $$f(\mathbf{x}) = \text{sign}(\mathbf{w}^T\mathbf{x} + b) = \text{sign}\left(\sum_{j=1}^d w_j x_j + b\right)$$

What You Will Learn

Mathematical Foundation

The Linear Model

A linear classifier is characterized by:

Weight vector $\mathbf{w} \in \mathbb{R}^d$ — determines feature importance and direction
Bias $b \in \mathbb{R}$ — shifts the decision boundary from the origin

The decision function (also called discriminant function) is: $$g(\mathbf{x}) = \mathbf{w}^T\mathbf{x} + b = w_1 x_1 + w_2 x_2 + \cdots + w_d x_d + b$$

The prediction rule is: $$\hat{y} = \begin{cases} 1 & \text{if } g(\mathbf{x}) \geq 0 \ 0 & \text{if } g(\mathbf{x}) < 0 \end{cases}$$

Homogeneous Form

We can absorb the bias into the weight vector by augmenting features: $$\tilde{\mathbf{x}} = [1, x_1, x_2, \ldots, x_d]^T, \quad \tilde{\mathbf{w}} = [b, w_1, w_2, \ldots, w_d]^T$$

Then: $g(\mathbf{x}) = \tilde{\mathbf{w}}^T\tilde{\mathbf{x}}$

Linear Classifier Properties
Property	Value/Description
Number of parameters	$d + 1$ (weights + bias)
Decision boundary	Hyperplane in $\mathbb{R}^d$
Prediction complexity	$O(d)$ per instance
Training complexity	Varies by algorithm
VC dimension	$d + 1$

Feature Importance from Weights

The Perceptron Algorithm

The perceptron (Rosenblatt, 1958) is the simplest linear classifier, learned through iterative error correction.

Algorithm (using $y \in {-1, +1}$)

Initialize $\mathbf{w} = \mathbf{0}$, $b = 0$
For each epoch:
- For each $(\mathbf{x}_i, y_i)$:
  - If $y_i(\mathbf{w}^T\mathbf{x}_i + b) \leq 0$ (misclassified):
    - $\mathbf{w} \leftarrow \mathbf{w} + \eta y_i \mathbf{x}_i$
    - $b \leftarrow b + \eta y_i$
Repeat until convergence or max iterations

The learning rate $\eta$ controls update magnitude (often $\eta = 1$).

perceptron.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import numpy as np
 
class Perceptron:
    """Simple perceptron classifier."""
    
    def __init__(self, learning_rate=1.0, max_epochs=1000):
        self.lr = learning_rate
        self.max_epochs = max_epochs
        
    def fit(self, X, y):
        """Train perceptron on data."""
        n_samples, n_features = X.shape
        
        # Convert labels to {-1, +1}
        y_signed = 2 * y - 1
        
        # Initialize weights
        self.w = np.zeros(n_features)
        self.b = 0.0
        
        for epoch in range(self.max_epochs):
            errors = 0
            for i in range(n_samples):
                if y_signed[i] * (self.w @ X[i] + self.b) <= 0:
                    # Misclassified - update
                    self.w += self.lr * y_signed[i] * X[i]
                    self.b += self.lr * y_signed[i]
                    errors += 1
            
            if errors == 0:
                print(f"Converged at epoch {epoch}")
                break
        
        return self
    
    def predict(self, X):
        """Make predictions."""
        scores = X @ self.w + self.b
        return (scores >= 0).astype(int)
 
# Example
np.random.seed(42)
X = np.vstack([np.random.randn(50, 2) + [2, 2],
               np.random.randn(50, 2) + [-2, -2]])
y = np.array([1]*50 + [0]*50)
 
model = Perceptron()
model.fit(X, y)
accuracy = (model.predict(X) == y).mean()
print(f"Training accuracy: {accuracy:.2%}")

Perceptron Convergence Theorem

Linear Separability

Definition

A dataset is linearly separable if there exists a hyperplane that perfectly separates the classes: $$\exists \mathbf{w}, b : y_i(\mathbf{w}^T\mathbf{x}_i + b) > 0 \quad \forall i$$

Testing for Linear Separability

Linear separability can be checked via linear programming: find $\mathbf{w}, b$ such that $y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1$ for all $i$. If feasible, data is separable.

Linear Separability in Practice
Dimension	Probability of Separability	Implication
$d < n$	Typically low	More samples than features → likely overlap
$d \approx n$	Moderate	May or may not be separable
$d \gg n$	Very high	High-dim space easily separates finite points

The XOR Problem

Solutions for Non-Separable Data

Feature engineering — Add polynomial or interaction features
Kernel methods — Implicit mapping to high-dimensional space
Soft margins — Allow some misclassifications (SVM approach)
Probabilistic models — Logistic regression handles overlap gracefully

Geometry of Linear Classifiers

The Weight Vector as Normal

The weight vector $\mathbf{w}$ is perpendicular (normal) to the decision boundary. This means:

$\mathbf{w}$ points toward the positive class region
Moving in direction $\mathbf{w}$ increases $g(\mathbf{x})$
Moving perpendicular to $\mathbf{w}$ leaves $g(\mathbf{x})$ unchanged

Distance to the Boundary

For any point $\mathbf{x}$, the signed distance to the decision boundary is: $$\text{distance} = \frac{g(\mathbf{x})}{|\mathbf{w}|} = \frac{\mathbf{w}^T\mathbf{x} + b}{|\mathbf{w}|}$$

Positive for class 1 side, negative for class 0 side.

Geometric Insights

•Scaling invariance — Multiplying $\mathbf{w}$ and $b$ by constant $c > 0$ doesn't change predictions
•Decision boundary — The set where $\mathbf{w}^T\mathbf{x} + b = 0$
•Prediction confidence — Magnitude $|g(\mathbf{x})|$ indicates confidence
•Margin — Minimum distance from training points to boundary

Visualizing in 2D

In 2D, the boundary is a line: w₁x₁ + w₂x₂ + b = 0. The slope is -w₁/w₂ and intercept is -b/w₂. Points above/below the line get different predictions based on which side w points to.

Variants of Linear Classifiers

Several important algorithms produce linear classifiers, differing in their training objectives and resulting properties.

Linear Classifier Comparison
Algorithm	Loss Function	Output	Key Property
Perceptron	Hinge (margin 0)	Hard labels	Online, simple updates
Logistic Regression	Log loss	Probabilities	Calibrated, convex
Linear SVM	Hinge (margin 1)	Hard labels	Maximum margin
LDA	Generative/Gaussian	Probabilities	Optimal for Gaussian classes
Linear Regression + Threshold	Squared loss	Hard labels	Simple but suboptimal

Which to Choose?

Need probabilities? → Logistic regression or LDA
Large-scale data? → SVM with SGD or logistic regression
Theoretical guarantees? → SVM (margin bounds) or LDA (Gaussian assumption)
Interpretability? → Any linear classifier (coefficients are interpretable)
Online learning? → Perceptron or online gradient descent

When Linear Classifiers Succeed

Despite their simplicity, linear classifiers excel in several important settings:

Conditions Favoring Linear Classifiers

•High-dimensional sparse data — Text classification with bag-of-words often has d >> n, making linear boundaries sufficient
•Limited training data — Simpler models generalize better with small samples
•Interpretability requirements — Feature weights explain predictions directly
•Computational constraints — O(d) prediction, efficient training
•Approximately linear boundaries — Many real problems have near-linear structure
•Feature engineering applied — Good features can make problems linearly separable

The High-Dimensional Surprise

Summary

Key Takeaways

•Linear classifiers predict via sign of $\mathbf{w}^T\mathbf{x} + b$, creating hyperplane boundaries
•The perceptron learns through iterative error correction—converges for separable data
•Linear separability is achievable when hyperplane can perfectly separate classes
•Weight vector geometry — $\mathbf{w}$ is normal to boundary, determines feature importance
•Different algorithms (perceptron, logistic regression, SVM, LDA) produce linear classifiers via different objectives
•Linear models excel in high-dimensions, sparse data, and when interpretability matters

Linear Foundations Complete

4 / 5