Binary Classification Problem - Learning Module

Loading content...

0/245

Problem Formulation

From Regression to Classification

In the preceding chapters, we mastered the art of predicting continuous values—house prices, stock returns, and temperature forecasts. But a vast universe of real-world problems demands something fundamentally different: predicting categories.

Will this email be spam or legitimate? Will this patient develop diabetes? Will this customer churn? Will this transaction be fraudulent? These questions share a common structure: they require a yes or no answer, a prediction from exactly two possible outcomes.

This is the domain of binary classification—arguably the most important and widely-deployed category of machine learning problems in production systems worldwide. Understanding its formulation is essential before we can build effective classifiers.

What You Will Learn

By the end of this page, you will have a rigorous understanding of the binary classification problem: its formal mathematical definition, standard notation conventions, dataset structure, the transition from regression thinking to classification thinking, and the rich landscape of real-world applications that drive this field.

The Classification Paradigm

Classification represents a fundamental shift from regression in supervised learning. While regression maps inputs to a continuous output space, classification maps inputs to a discrete set of categories called classes or labels.

Definition: Classification Problem

Given a training dataset $\mathcal{D} = {(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_n, y_n)}$ where:

$\mathbf{x}_i \in \mathbb{R}^d$ is a feature vector (also called input, predictor, or independent variable)
$y_i \in \mathcal{Y}$ is the class label (also called target, response, or dependent variable)
$\mathcal{Y}$ is a finite set of possible classes

The goal is to learn a classifier $f: \mathbb{R}^d \rightarrow \mathcal{Y}$ that generalizes well to unseen data.

For binary classification specifically, $\mathcal{Y} = {0, 1}$ (or equivalently ${-1, +1}$ in some formulations). This two-class structure is not merely a special case—it has profound implications for algorithm design, theoretical analysis, and practical implementation.

Why Binary is Special

Binary classification enjoys unique theoretical properties. Any multi-class problem can be decomposed into binary subproblems (one-vs-all, one-vs-one). Many elegant algorithms (logistic regression, SVM, perceptron) were first developed for the binary case. The binary setting also enables rich probabilistic interpretations where P(y=1|x) completely determines the prediction.

Regression vs. Classification: Fundamental Differences
Aspect	Regression	Binary Classification
Output space	$y \in \mathbb{R}$ (continuous)	$y \in {0, 1}$ (discrete)
Prediction goal	Exact numerical value	Category membership
Error measurement	Distance from true value (MSE, MAE)	Misclassification rate, likelihood
Geometric interpretation	Fitting a surface	Finding a separating boundary
Natural loss function	Squared loss, absolute loss	0-1 loss, log loss, hinge loss
Baseline model	Predict the mean	Predict the majority class

Formal Mathematical Definition

Let us establish the rigorous mathematical framework for binary classification. This precision is not pedantry—it reveals the structure that enables principled algorithm development.

The Data Generating Process

We assume data points $(\mathbf{x}, y)$ are drawn i.i.d. from an unknown joint distribution $P(\mathbf{X}, Y)$ over $\mathbb{R}^d \times {0, 1}$. This distribution can be factored as:

$$P(\mathbf{X}, Y) = P(Y|\mathbf{X}) \cdot P(\mathbf{X}) = P(\mathbf{X}|Y) \cdot P(Y)$$

Both factorizations are valid and lead to fundamentally different approaches:

Discriminative models directly estimate $P(Y|\mathbf{X})$ (e.g., logistic regression)
Generative models estimate $P(\mathbf{X}|Y)$ and $P(Y)$, then apply Bayes' rule (e.g., Naive Bayes)

The Classification Function

A classifier is a measurable function $f: \mathbb{R}^d \rightarrow {0, 1}$ that assigns a class label to each point in feature space. The goal is to find $f$ that minimizes the expected risk (expected loss):

$$R(f) = \mathbb{E}_{(\mathbf{X}, Y) \sim P}[L(Y, f(\mathbf{X}))]$$

where $L: {0,1} \times {0,1} \rightarrow \mathbb{R}$ is a loss function quantifying the cost of prediction errors.

The i.i.d. Assumption

The assumption that training samples are independent and identically distributed is foundational. It implies each example provides independent information about P(X,Y), enables concentration inequalities for generalization bounds, and justifies splitting data into train/validation/test sets. Violations (temporal dependence, distribution shift) require specialized techniques.

The Bayes Optimal Classifier

The Bayes optimal classifier $f^*$ minimizes the true risk over all possible classifiers:

$$f^*(\mathbf{x}) = \arg\max_{y \in {0,1}} P(Y = y | \mathbf{X} = \mathbf{x})$$

For binary classification with 0-1 loss, this simplifies to:

$$f^*(\mathbf{x}) = \mathbb{1}[P(Y = 1 | \mathbf{X} = \mathbf{x}) > 0.5]$$

where $\mathbb{1}[\cdot]$ is the indicator function. This tells us the optimal strategy is remarkably simple: predict class 1 if and only if it's more probable than class 0 given the features.

The Bayes error rate is the irreducible error—the minimum achievable error even with access to the true distribution:

$$R(f^*) = \mathbb{E}_{\mathbf{X}}[\min(P(Y=0|\mathbf{X}), P(Y=1|\mathbf{X}))]$$

This represents the fundamental noise in the problem: regions where class overlap prevents perfect classification.

bayes_optimal_classifier.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import numpy as np
from typing import Callable
 
def bayes_optimal_classifier(
    x: np.ndarray,
    posterior: Callable[[np.ndarray], float]
) -> int:
    """
    Bayes optimal classifier for binary classification.
    
    Args:
        x: Feature vector of shape (d,)
        posterior: Function computing P(Y=1|X=x)
    
    Returns:
        Predicted class label (0 or 1)
    """
    prob_class_1 = posterior(x)
    return 1 if prob_class_1 > 0.5 else 0
 
 
def bayes_error_rate(
    X: np.ndarray,
    posterior: Callable[[np.ndarray], float]
) -> float:
    """
    Estimate the Bayes error rate via Monte Carlo.
    
    Args:
        X: Sample of feature vectors, shape (n, d)
        posterior: Function computing P(Y=1|X=x)
    
    Returns:
        Estimated Bayes error rate
    """
    errors = []
    for x in X:
        p1 = posterior(x)
        # Error is the probability of the less likely class
        errors.append(min(p1, 1 - p1))
    return np.mean(errors)
 
 
# Example: Simple 1D problem with overlapping Gaussians
def create_gaussian_posterior(mu0: float, mu1: float, sigma: float, pi1: float = 0.5):
    """
    Create posterior for 1D Gaussian class-conditional distributions.
    
    Classes have equal variance sigma^2, means mu0 and mu1.
    Prior probability of class 1 is pi1.
    """
    def posterior(x: np.ndarray) -> float:
        # Likelihood ratio using log to avoid overflow
        log_lr = ((x - mu0)**2 - (x - mu1)**2) / (2 * sigma**2)
        # Apply Bayes rule with prior
        log_odds = log_lr + np.log(pi1 / (1 - pi1))
        return 1 / (1 + np.exp(-log_odds))
    return posterior
 
 
# Demonstration
if __name__ == "__main__":
    # Two Gaussians with significant overlap
    posterior = create_gaussian_posterior(mu0=-1, mu1=1, sigma=1.5)
    
    # Generate test points
    X_test = np.linspace(-5, 5, 1000).reshape(-1, 1)
    
    # Compute Bayes error
    bayes_err = bayes_error_rate(X_test, lambda x: posterior(x[0]))
    print(f"Estimated Bayes error rate: {bayes_err:.4f}")
    print(f"This is irreducible - no classifier can do better!")

Standard Notation and Conventions

Consistent notation is crucial for understanding classification literature. Different communities (statistics, machine learning, computer science) sometimes use conflicting conventions. We establish the terminology used throughout this course.

Label Encoding Conventions

Two primary encodings exist for binary labels, each with distinct advantages:

Binary Label Encoding Schemes
Encoding	Values	Common Use Cases	Mathematical Properties
0/1 encoding	$y \in {0, 1}$	Logistic regression, neural networks, probability interpretation	$\mathbb{E}[Y\|X] = P(Y=1\|X)$, natural for Bernoulli likelihood
-1/+1 encoding	$y \in {-1, +1}$	SVM, perceptron, margin-based methods	Symmetric around 0, $y \cdot f(x) > 0$ means correct classification

Class Terminology

In binary classification, classes often have asymmetric roles:

Term	Alternative Names	Description	Example
Positive class	Target, event, class 1	The class of primary interest	Fraud, disease, spam
Negative class	Non-target, non-event, class 0	The default or baseline class	Legitimate, healthy, not spam

This asymmetry is not arbitrary—it reflects how problems are framed in practice. We typically want to detect the positive class. The naming affects which errors we measure and prioritize.

Prediction Types

Classifiers can produce different types of outputs:

Classifier Output Types

•Hard predictions $\hat{y} \in {0, 1}$ — A definitive class assignment. Simple but discards uncertainty information.
•Soft predictions $\hat{p} = P(Y=1|\mathbf{X})$ — A probability estimate. Richer output enabling calibrated confidence and flexible thresholding.
•Scores $s(\mathbf{x}) \in \mathbb{R}$ — An uncalibrated confidence value. Higher values indicate higher likelihood of positive class, but not directly interpretable as probability.

Probabilities Are More Valuable Than Hard Labels

Always prefer models that output probabilities when possible. With P(Y=1|X), you can recover hard predictions at any threshold, rank instances by confidence, make cost-sensitive decisions, combine predictions with other information, and calibrate for specific operating points. Hard labels lose this flexibility.

Feature and Data Notation

We adopt matrix notation for efficiency:

$$\mathbf{X} = \begin{pmatrix} \rule[.5ex]{2.5ex}{0.4pt} & \mathbf{x}_1^T & \rule[.5ex]{2.5ex}{0.4pt} \ \rule[.5ex]{2.5ex}{0.4pt} & \mathbf{x}_2^T & \rule[.5ex]{2.5ex}{0.4pt} \ & \vdots & \ \rule[.5ex]{2.5ex}{0.4pt} & \mathbf{x}_n^T & \rule[.5ex]{2.5ex}{0.4pt} \end{pmatrix} \in \mathbb{R}^{n \times d}, \quad \mathbf{y} = \begin{pmatrix} y_1 \ y_2 \ \vdots \ y_n \end{pmatrix} \in {0,1}^n$$

where:

$n$ = number of training examples (samples, observations)
$d$ = number of features (dimensions, predictors)
$\mathbf{x}_i$ = $i$-th feature vector (column vector, $d \times 1$)
$\mathbf{x}i^{(j)}$ or $x{ij}$ = $j$-th feature of $i$-th example

We often augment the design matrix with a column of ones for the bias term: $$\tilde{\mathbf{X}} = [\mathbf{1} | \mathbf{X}] \in \mathbb{R}^{n \times (d+1)}$$

Dataset Structure and Properties

Understanding the structure of classification datasets reveals important considerations for algorithm design and evaluation.

Class Balance and Imbalance

The class distribution describes the proportion of examples in each class:

$$\pi_1 = P(Y = 1) = \frac{1}{n}\sum_{i=1}^n \mathbb{1}[y_i = 1], \quad \pi_0 = 1 - \pi_1$$

Datasets are classified by their balance:

Class Imbalance Severity Levels
Category	Minority Ratio	Examples	Challenges
Balanced	40-60%	Sentiment analysis, image classification	Standard algorithms work well
Moderate imbalance	10-40%	Medical diagnosis, customer churn	May need class weights or resampling
High imbalance	1-10%	Fraud detection, rare disease	Accuracy becomes misleading; need specialized metrics
Extreme imbalance	<1%	Credit card fraud, network intrusion	Standard methods fail; need anomaly detection approaches

The Class Imbalance Trap

With 99% negative examples, a classifier predicting 'negative' always achieves 99% accuracy—while detecting zero positive cases! Class imbalance makes accuracy deeply misleading. We'll explore appropriate metrics (precision, recall, F1, AUC) and mitigation strategies later in this course.

Feature Types in Classification

Classification datasets contain various feature types, each requiring appropriate handling:

Feature Types and Treatment

•Continuous features — Real-valued measurements (age, income, temperature). May require scaling/normalization. Work directly with most algorithms.
•Categorical features — Discrete unordered categories (color, country, product type). Require encoding (one-hot, target encoding, embeddings).
•Ordinal features — Discrete ordered categories (education level, rating). Can use integer encoding or treat as categorical.
•Binary features — Two possible values (gender, yes/no flags). Natural representation, no special handling needed.
•Text features — Unstructured text (reviews, emails). Require vectorization (bag-of-words, TF-IDF, embeddings).
•Image features — Pixel arrays or learned representations. Typically use CNNs or pretrained embeddings.

The Separability Spectrum

Datasets differ in how distinctly classes separate in feature space:

Linearly separable: A hyperplane perfectly separates classes. Rare in real data but theoretically important.
Nonlinearly separable: A curved boundary perfectly separates classes. Achievable with feature engineering or nonlinear models.
Overlapping classes: No perfect separation exists—some feature vectors belong plausibly to either class. Most real problems fall here.

The degree of overlap directly determines the Bayes error rate. Highly overlapping classes yield high irreducible error, setting a ceiling on any classifier's performance.

dataset_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
import numpy as np
from collections import Counter
from typing import Tuple, Dict, Any
 
def analyze_classification_dataset(
    X: np.ndarray, 
    y: np.ndarray
) -> Dict[str, Any]:
    """
    Comprehensive analysis of a binary classification dataset.
    
    Args:
        X: Feature matrix of shape (n_samples, n_features)
        y: Label vector of shape (n_samples,)
    
    Returns:
        Dictionary containing dataset statistics
    """
    n_samples, n_features = X.shape
    
    # Class distribution
    class_counts = Counter(y)
    n_positive = class_counts.get(1, 0)
    n_negative = class_counts.get(0, 0)
    imbalance_ratio = n_positive / n_samples if n_samples > 0 else 0
    
    # Categorize imbalance
    if 0.4 <= imbalance_ratio <= 0.6:
        imbalance_category = "Balanced"
    elif 0.1 <= imbalance_ratio <= 0.9:
        imbalance_category = "Moderate imbalance"
    elif 0.01 <= imbalance_ratio <= 0.99:
        imbalance_category = "High imbalance"
    else:
        imbalance_category = "Extreme imbalance"
    
    # Feature statistics
    feature_means = X.mean(axis=0)
    feature_stds = X.std(axis=0)
    feature_mins = X.min(axis=0)
    feature_maxs = X.max(axis=0)
    
    # Per-class feature statistics
    X_pos = X[y == 1]
    X_neg = X[y == 0]
    
    means_pos = X_pos.mean(axis=0) if len(X_pos) > 0 else np.zeros(n_features)
    means_neg = X_neg.mean(axis=0) if len(X_neg) > 0 else np.zeros(n_features)
    
    # Feature discriminability (difference in means / pooled std)
    pooled_std = np.sqrt((X_pos.var(axis=0) * len(X_pos) + 
                          X_neg.var(axis=0) * len(X_neg)) / n_samples)
    pooled_std = np.where(pooled_std == 0, 1, pooled_std)  # Avoid division by zero
    discriminability = np.abs(means_pos - means_neg) / pooled_std
    
    return {
        "n_samples": n_samples,
        "n_features": n_features,
        "n_positive": n_positive,
        "n_negative": n_negative,
        "positive_ratio": imbalance_ratio,
        "imbalance_category": imbalance_category,
        "feature_means": feature_means,
        "feature_stds": feature_stds,
        "feature_ranges": list(zip(feature_mins, feature_maxs)),
        "class_means_positive": means_pos,
        "class_means_negative": means_neg,
        "feature_discriminability": discriminability,
        "most_discriminative_feature": int(np.argmax(discriminability)),
    }
 
 
def report_dataset_summary(analysis: Dict[str, Any]) -> str:
    """Generate human-readable dataset summary."""
    lines = [
        "=" * 60,
        "BINARY CLASSIFICATION DATASET SUMMARY",
        "=" * 60,
        f"Samples: {analysis['n_samples']:,}",
        f"Features: {analysis['n_features']}",
        f"",
        "CLASS DISTRIBUTION:",
        f"  Positive (y=1): {analysis['n_positive']:,} ({analysis['positive_ratio']:.1%})",
        f"  Negative (y=0): {analysis['n_negative']:,} ({1-analysis['positive_ratio']:.1%})",
        f"  Category: {analysis['imbalance_category']}",
        f"",
        f"MOST DISCRIMINATIVE FEATURE: #{analysis['most_discriminative_feature']}",
        f"  (discriminability score: {analysis['feature_discriminability'][analysis['most_discriminative_feature']]:.3f})",
        "=" * 60,
    ]
    return "\n".join(lines)
 
 
# Example usage
if __name__ == "__main__":
    # Generate synthetic dataset
    np.random.seed(42)
    n_pos, n_neg = 100, 900  # Imbalanced
    
    # Positive class: centered at (2, 2)
    X_pos = np.random.randn(n_pos, 2) + [2, 2]
    y_pos = np.ones(n_pos)
    
    # Negative class: centered at (0, 0)
    X_neg = np.random.randn(n_neg, 2)
    y_neg = np.zeros(n_neg)
    
    X = np.vstack([X_pos, X_neg])
    y = np.hstack([y_pos, y_neg])
    
    analysis = analyze_classification_dataset(X, y)
    print(report_dataset_summary(analysis))

Real-World Applications

Binary classification underlies countless systems operating at scale today. Understanding these applications illuminates both the importance of the problem and the diverse contexts where classification techniques apply.

Healthcare and Medicine

Medical diagnosis is a natural classification domain. Given patient features (symptoms, test results, demographics), predict the presence or absence of a condition.

Healthcare Classification Applications
Application	Positive Class	Features	Scale
Cancer detection	Malignant tumor	Imaging, biomarkers, genetics	Millions of screenings/year
Diabetes prediction	Diabetic patient	Blood glucose, BMI, age, family history	Billions at risk globally
Hospital readmission	30-day readmission	Diagnosis codes, prior history, vitals	35M+ US hospitalizations/year
Disease outbreak detection	Outbreak active	Symptom reports, geographic data	Real-time surveillance

Finance and Risk

Financial systems rely heavily on binary classification for risk assessment and fraud prevention:

•Credit scoring — Will this applicant default on a loan? Features include income, debt ratio, credit history, employment. Billions of decisions annually.
•Fraud detection — Is this transaction fraudulent? Features include transaction amount, location, time patterns, device fingerprints. Extreme imbalance (0.1% fraud rate).
•Insurance underwriting — Will this policy holder file a claim? Features include demographics, history, risk factors. Determines premiums and coverage.
•Algorithmic trading — Will this asset's price rise or fall? Features include price history, volume, sentiment, macroeconomic indicators. Millisecond latency requirements.

Technology and Internet

Web-scale systems process billions of classification decisions daily:

•Spam detection — Email providers classify billions of messages daily. Features include content, sender reputation, headers, links. Must balance false positives (legitimate email marked spam) against false negatives (spam in inbox).
•Click-through prediction — Ad platforms predict whether users will click. Features include user history, context, ad content. Powers multi-billion dollar advertising ecosystems.
•Content moderation — Social platforms classify harmful content (hate speech, violence, misinformation). Features include text embeddings, image analysis, user signals. Massive scale with high stakes.
•Recommendation systems — Will this user engage with this content? Binary outcomes (click, purchase, watch) drive personalization at companies like Netflix, Amazon, Spotify.

The Cost of Errors Varies Dramatically

In spam filtering, a false positive (legit email to spam) may cause you to miss an important message. In cancer screening, a false negative (missing a tumor) could be fatal. In criminal justice, errors affect human liberty. Different applications demand different tradeoffs between precision and recall—a theme we'll explore deeply.

Scientific and Industrial Applications

Beyond consumer-facing systems, classification enables scientific discovery and industrial processes:

Scientific and Industrial Classification
Domain	Problem	Positive Class	Impact
Particle physics	Event classification	Higgs boson event	Enabled discovery of Higgs boson
Drug discovery	Compound screening	Active compound	Accelerates drug development 10-100x
Manufacturing	Defect detection	Defective part	Reduces waste and recalls
Astronomy	Object classification	Exoplanet signature	Discovering planets in telescope data
Climate science	Extreme event prediction	Hurricane/flood	Enables early warning systems

From Regression to Classification Thinking

Having studied regression extensively, you might wonder: "Why not just use regression for classification?" This natural question reveals important insights about why classification requires different approaches.

The Naive Approach: Linear Regression for Classification

Consider fitting a linear regression model $f(\mathbf{x}) = \mathbf{w}^T\mathbf{x} + b$ to binary labels and thresholding:

$$\hat{y} = \begin{cases} 1 & \text{if } f(\mathbf{x}) \geq 0.5 \ 0 & \text{if } f(\mathbf{x}) < 0.5 \end{cases}$$

This seems reasonable but has fundamental problems:

Problems with Linear Regression for Classification

•Unbounded outputs — Linear regression produces $f(\mathbf{x}) \in (-\infty, +\infty)$, but probabilities must lie in $[0, 1]$. Predictions like $P(Y=1) = 1.3$ are meaningless.
•Sensitivity to outliers — Extreme points pull the regression line, potentially shifting the decision boundary incorrectly. A single outlier can flip classifications of many points.
•Heteroscedasticity — Regression assumes constant variance, but binary outcomes have variance $p(1-p)$ that varies with $p$. The assumption is violated by construction.
•Wrong loss function — Squared error penalizes confident correct predictions. Predicting 0.99 for a true label of 1 incurs loss, but this should be rewarded, not penalized.
•Poor probability calibration — Even if thresholding works for classification, the raw outputs are not calibrated probabilities and cannot support uncertainty quantification.

The Outlier Catastrophe

Consider a dataset with two classes cleanly separable at x=0. Add one extreme positive example at x=100. Linear regression tilts toward this outlier, shifting the decision boundary and misclassifying many points that were previously correct. Classification-specific methods are robust to this scenario.

What Classification Thinking Demands

Proper classification methods address these issues through key design choices:

Principles of Classification Methods

•Bounded outputs — Outputs constrained to $[0, 1]$ for probability interpretation. Achieved via sigmoid/softmax transformations.
•Appropriate loss functions — Loss functions that reward confident correct predictions and penalize confident errors. Log loss (cross-entropy) is the canonical choice.
•Decision boundary focus — Methods that explicitly find class boundaries rather than fitting surfaces. Margin-based methods (SVM, perceptron) exemplify this.
•Probabilistic interpretation — Principled probability estimates enabling calibrated uncertainty. Bayesian methods and proper scoring rules support this.
•Robustness to class imbalance — Built-in mechanisms to handle unequal class frequencies. Weighted losses, sampling strategies, and threshold tuning.

The Sigmoid: Bridging Linear Models and Classification

The key insight enabling logistic regression is the sigmoid function (also called the logistic function):

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

This function transforms any real number to the interval $(0, 1)$:

As $z \to +\infty$, $\sigma(z) \to 1$
As $z \to -\infty$, $\sigma(z) \to 0$
At $z = 0$, $\sigma(0) = 0.5$

By composing a linear function with the sigmoid, we get:

$$P(Y = 1 | \mathbf{x}) = \sigma(\mathbf{w}^T\mathbf{x} + b) = \frac{1}{1 + e^{-(\mathbf{w}^T\mathbf{x} + b)}}$$

This is the logistic regression model—a proper probabilistic classifier that we'll study in depth in the next module.

sigmoid_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import numpy as np
 
def sigmoid(z: np.ndarray) -> np.ndarray:
    """
    Sigmoid (logistic) function.
    
    Numerically stable implementation that avoids overflow.
    """
    # Use numerically stable version
    return np.where(
        z >= 0,
        1 / (1 + np.exp(-z)),
        np.exp(z) / (1 + np.exp(z))
    )
 
 
def sigmoid_derivative(z: np.ndarray) -> np.ndarray:
    """
    Derivative of the sigmoid function.
    
    d/dz sigmoid(z) = sigmoid(z) * (1 - sigmoid(z))
    """
    s = sigmoid(z)
    return s * (1 - s)
 
 
def demonstrate_sigmoid_properties():
    """Demonstrate key properties of the sigmoid function."""
    
    print("SIGMOID FUNCTION PROPERTIES")
    print("=" * 50)
    
    # Evaluate at key points
    test_points = [-10, -5, -2, -1, 0, 1, 2, 5, 10]
    print("\nValue at key points:")
    for z in test_points:
        print(f"  σ({z:3d}) = {sigmoid(np.array([z]))[0]:.6f}")
    
    # Symmetry property: σ(-z) = 1 - σ(z)
    print("\nSymmetry property: σ(-z) = 1 - σ(z)")
    for z in [1, 2, 5]:
        s_pos = sigmoid(np.array([z]))[0]
        s_neg = sigmoid(np.array([-z]))[0]
        print(f"  σ({z}) = {s_pos:.4f}, σ({-z}) = {s_neg:.4f}, sum = {s_pos + s_neg:.4f}")
    
    # Derivative is maximized at z=0
    print("\nDerivative at key points (max at z=0):")
    for z in [-2, -1, 0, 1, 2]:
        deriv = sigmoid_derivative(np.array([z]))[0]
        print(f"  σ'({z:2d}) = {deriv:.4f}")
    
    # Practical thresholds
    print("\nPractical saturation thresholds:")
    for threshold in [0.01, 0.001, 0.0001]:
        z_neg = np.log(threshold / (1 - threshold))
        z_pos = -z_neg
        print(f"  σ(z) ∈ [{threshold}, {1-threshold}] for z ∈ [{z_neg:.2f}, {z_pos:.2f}]")
 
 
if __name__ == "__main__":
    demonstrate_sigmoid_properties()

The Learning Setup

With the problem formulated, let's establish the complete learning setup that applies to all binary classification methods.

The Training Objective

Given training data $\mathcal{D} = {(\mathbf{x}i, y_i)}{i=1}^n$, we seek parameters $\boldsymbol{\theta}$ that minimize an empirical risk:

$$\hat{R}(\boldsymbol{\theta}) = \frac{1}{n}\sum_{i=1}^n L(y_i, f_{\boldsymbol{\theta}}(\mathbf{x}_i))$$

where $L$ is a loss function measuring prediction quality. The choice of loss function profoundly affects the resulting classifier.

The Generalization Goal

Minimizing training loss is insufficient—we care about generalization to unseen data. The true goal is minimizing the expected risk:

$$R(\boldsymbol{\theta}) = \mathbb{E}{(\mathbf{X}, Y) \sim P}[L(Y, f{\boldsymbol{\theta}}(\mathbf{X}))]$$

The gap between training risk and true risk is the generalization gap. Controlling this gap is fundamental to machine learning theory.

Empirical Risk Minimization (ERM)

Most learning algorithms follow the ERM principle: minimize average loss on training data. This is justified when training data is representative of the distribution we want to generalize to. When this assumption fails (distribution shift, small samples, adversarial settings), ERM must be augmented with regularization, robust optimization, or domain adaptation techniques.

Train/Validation/Test Split

To estimate generalization without accessing the true distribution, we partition data:

Set	Purpose	Typical Size	Access During Training
Training set	Fit model parameters	60-80%	Unlimited
Validation set	Tune hyperparameters, select model	10-20%	For selection only
Test set	Final evaluation	10-20%	Single final evaluation

Critical principle: The test set must remain untouched until final evaluation. Using it to guide model selection leads to overfitting to the test set, and performance estimates become optimistic.

Cross-Validation for Small Datasets

When data is limited, cross-validation provides better estimates:

k-fold CV: Split data into k folds, train on k-1, validate on 1, rotate through all folds
Stratified CV: Preserve class proportions in each fold (critical for imbalanced data)
Leave-one-out CV: k = n, maximal use of data but computationally expensive

Classification Workflow Summary

•Problem definition — Define positive/negative classes, gather labeled data, understand class distribution and costs of errors
•Data preparation — Split data, preprocess features, handle missing values, encode categoricals
•Model selection — Choose classifier family (logistic regression, SVM, trees, etc.) based on problem characteristics
•Training — Fit model parameters by minimizing training loss
•Validation — Tune hyperparameters, evaluate on validation set, iterate on model choices
•Testing — Final evaluation on held-out test set; report with confidence intervals
•Deployment — Deploy model, monitor for drift, plan for retraining

Summary and Path Forward

We've established the rigorous foundation for binary classification. Let's consolidate the key concepts:

Key Takeaways

•Binary classification predicts membership in one of two classes from feature vectors—fundamentally different from regression's continuous predictions.
•The Bayes optimal classifier sets the theoretical performance ceiling by predicting the class with higher posterior probability.
•Standard notation uses 0/1 or -1/+1 encoding, with positive class typically being the rare/important category to detect.
•Class imbalance is pervasive in real applications and makes accuracy a misleading metric—specialized evaluation approaches are essential.
•Real-world applications span healthcare, finance, technology, and science—binary classification is ubiquitous in deployed ML systems.
•Linear regression fails for classification due to unbounded outputs, outlier sensitivity, and inappropriate loss functions.
•The sigmoid function bridges linear models to valid probabilities, enabling logistic regression.
•Proper evaluation requires train/validation/test splits and careful protocol to avoid overfitting.

What's Next

With the problem formulated, we turn to the geometric view: decision boundaries. In the next page, we'll explore how classifiers partition feature space into regions assigned to each class, visualize these boundaries for different hypothesis classes, and understand the tradeoff between boundary flexibility and generalization.

Foundation Complete

You now have the mathematical framework and intuition for binary classification. This foundation supports everything that follows: decision boundaries, probabilistic interpretation, linear classifiers, loss functions, and ultimately logistic regression. Each subsequent page builds on these concepts.