Logistic Regression Model - Learning Module

Loading content...

0/278

Decision Boundary

Where Classification Happens

Every classifier must ultimately make a decision: class 0 or class 1? In logistic regression, this decision crystallizes at the decision boundary—a geometric surface in feature space that separates regions predicted as one class from regions predicted as the other.

Understanding decision boundaries transforms logistic regression from an abstract probability machine into a geometric concept you can visualize and reason about. You'll see exactly where the model is confident, where it's uncertain, and why certain points get classified as they do.

This geometric perspective is not just intellectually satisfying—it's practically essential for debugging models, understanding their limitations, and knowing when to choose more complex alternatives.

What You Will Learn

By the end of this page, you will understand: (1) the mathematical definition of the decision boundary, (2) how to visualize boundaries in 2D and understand them in higher dimensions, (3) the relationship between distance from boundary and prediction confidence, (4) the concept of margin and its connection to generalization, and (5) the inherent limitations of linear decision boundaries.

Mathematical Definition of the Decision Boundary

The decision boundary is the set of all points in feature space where the classifier is exactly undecided—where $P(Y=1|\mathbf{x}) = P(Y=0|\mathbf{x}) = 0.5$.

Derivation

Since $P(Y=1|\mathbf{x}) = \sigma(\mathbf{w}^T\mathbf{x} + b)$ and $\sigma(0) = 0.5$, the decision boundary occurs when:

$$\mathbf{w}^T\mathbf{x} + b = 0$$

Expanding in terms of individual features:

$$w_1 x_1 + w_2 x_2 + \cdots + w_d x_d + b = 0$$

This is the equation of a hyperplane in $\mathbb{R}^d$.

Hyperplane Terminology

In $\mathbb{R}^2$ (2 features): a line
In $\mathbb{R}^3$ (3 features): a plane
In $\mathbb{R}^d$ (d features): a (d-1)-dimensional hyperplane

Key Properties

Linear: The boundary is always a flat surface (no curves)
Divides space into two half-spaces: All points on one side map to class 1, all on the other to class 0
Normal vector: $\mathbf{w}$ is perpendicular to the boundary
Offset: $b$ determines how far the boundary is from the origin

Decision Boundary in Different Dimensions
Dimensions (d)	Boundary Type	Equation	Example
1	Point	w₁x₁ + b = 0 → x₁ = -b/w₁	Age threshold for approval
2	Line	w₁x₁ + w₂x₂ + b = 0	Height-weight classification
3	Plane	w₁x₁ + w₂x₂ + w₃x₃ + b = 0	RGB color classification
d	(d-1)-hyperplane	w^T x + b = 0	High-dimensional text classification

Prediction Rule Based on Boundary

The classification rule becomes simple:

$$\hat{y} = \begin{cases} 1 & \text{if } \mathbf{w}^T\mathbf{x} + b > 0 \ 0 & \text{if } \mathbf{w}^T\mathbf{x} + b < 0 \end{cases}$$

(Points exactly on the boundary are typically assigned to class 1 by convention.)

The side of the boundary determines the class. The distance from the boundary determines the confidence.

Why Always Linear?

The linear boundary is a direct consequence of using a linear combination (w^T x + b) as input to the sigmoid. To get nonlinear boundaries, we must either (1) engineer nonlinear features, (2) use kernel methods, or (3) use inherently nonlinear models like neural networks or decision trees.

Visualizing Decision Boundaries in 2D

Two-dimensional feature spaces allow us to see decision boundaries directly. This visualization builds intuition that transfers to higher dimensions.

Anatomy of a 2D Decision Boundary

For features $(x_1, x_2)$ with model parameters $(w_1, w_2, b)$:

$$w_1 x_1 + w_2 x_2 + b = 0$$

Rearranging to slope-intercept form:

$$x_2 = -\frac{w_1}{w_2} x_1 - \frac{b}{w_2}$$

This is a line with:

Slope: $-w_1/w_2$
y-intercept: $-b/w_2$

The Weight Vector as Normal

The weight vector $\mathbf{w} = (w_1, w_2)^T$ is always perpendicular to the decision line. To see this, note that for any two points $\mathbf{x}_a$ and $\mathbf{x}_b$ on the boundary:

$$\mathbf{w}^T\mathbf{x}_a = -b = \mathbf{w}^T\mathbf{x}_b$$

Therefore: $$\mathbf{w}^T(\mathbf{x}_a - \mathbf{x}_b) = 0$$

The vector $\mathbf{w}$ is orthogonal to any vector lying in the boundary—exactly what 'normal' means.

decision_boundary_2d.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
 
# Generate 2D data
np.random.seed(42)
X, y = make_classification(n_samples=200, n_features=2, n_redundant=0,
                            n_informative=2, n_clusters_per_class=1, 
                            class_sep=1.5, random_state=42)
 
# Fit logistic regression
model = LogisticRegression()
model.fit(X, y)
 
# Extract parameters
w = model.coef_[0]
b = model.intercept_[0]
 
print("Decision Boundary Analysis")
print("=" * 50)
print(f"Weight vector w: [{w[0]:.4f}, {w[1]:.4f}]")
print(f"Bias b: {b:.4f}")
print(f"Decision boundary equation: {w[0]:.3f}*x₁ + {w[1]:.3f}*x₂ + {b:.3f} = 0")
print(f"Slope: {-w[0]/w[1]:.4f}")
print(f"y-intercept: {-b/w[1]:.4f}")
 
# Create visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
 
# Plot 1: Decision boundary with data
ax1 = axes[0]
x1_range = np.linspace(X[:, 0].min() - 1, X[:, 0].max() + 1, 200)
x2_range = np.linspace(X[:, 1].min() - 1, X[:, 1].max() + 1, 200)
X1, X2 = np.meshgrid(x1_range, x2_range)
Z = model.predict_proba(np.c_[X1.ravel(), X2.ravel()])[:, 1].reshape(X1.shape)
 
# Probability contours
contour = ax1.contourf(X1, X2, Z, levels=20, cmap='RdBu_r', alpha=0.6)
plt.colorbar(contour, ax=ax1, label='P(Y=1)')
 
# Decision boundary (z = 0)
ax1.contour(X1, X2, Z, levels=[0.5], colors='black', linewidths=2)
 
# Data points
ax1.scatter(X[y==0, 0], X[y==0, 1], c='blue', label='Class 0', edgecolors='k', s=50)
ax1.scatter(X[y==1, 0], X[y==1, 1], c='red', label='Class 1', edgecolors='k', s=50)
 
# Weight vector (scaled for visibility)
center = np.array([0, -b/w[1]])  # Point on boundary
scale = 1.5
ax1.quiver(center[0], center[1], w[0]*scale, w[1]*scale, 
           color='green', scale=5, width=0.015, label='w (normal)', zorder=10)
 
ax1.set_xlabel('x₁')
ax1.set_ylabel('x₂')
ax1.set_title('Decision Boundary with Probability Contours')
ax1.legend(loc='upper left')
 
# Plot 2: Confidence bands
ax2 = axes[1]
ax2.contourf(X1, X2, Z, levels=[0, 0.25, 0.5, 0.75, 1.0], 
             colors=['#2166ac', '#92c5de', '#f4a582', '#b2182b'], alpha=0.7)
ax2.contour(X1, X2, Z, levels=[0.25, 0.5, 0.75], colors='black', linewidths=1, linestyles=['--', '-', '--'])
ax2.scatter(X[y==0, 0], X[y==0, 1], c='blue', edgecolors='k', s=30)
ax2.scatter(X[y==1, 0], X[y==1, 1], c='red', edgecolors='k', s=30)
ax2.set_xlabel('x₁')
ax2.set_ylabel('x₂')
ax2.set_title('Confidence Bands (P<0.25, 0.25-0.5, 0.5-0.75, P>0.75)')
 
plt.tight_layout()
plt.savefig('decision_boundary_2d.png', dpi=150)
plt.show()

Reading the Probability Contours

The probability contours (iso-probability lines) are parallel to the decision boundary. Moving perpendicular to the boundary (in the w direction) changes probability most rapidly. Moving parallel to the boundary doesn't change probability at all. The contours are equally spaced in log-odds, but compressed near 0 and 1 in probability.

Distance from Boundary and Classification Confidence

The distance from a point to the decision boundary directly determines how confident the model is in its prediction. This relationship is fundamental to understanding logistic regression's behavior.

Signed Distance to the Boundary

For a point $\mathbf{x}$, the signed distance to the decision boundary is:

$$d(\mathbf{x}) = \frac{\mathbf{w}^T\mathbf{x} + b}{|\mathbf{w}|}$$

where $|\mathbf{w}| = \sqrt{w_1^2 + w_2^2 + \cdots + w_d^2}$ is the Euclidean norm of the weight vector.

Properties of the Signed Distance

$d > 0$: Point is on the class-1 side (predicted class 1)
$d < 0$: Point is on the class-0 side (predicted class 0)
$d = 0$: Point is exactly on the boundary
$|d|$: Absolute distance (how far from the boundary)

From Distance to Probability

The probability prediction depends on $z = \mathbf{w}^T\mathbf{x} + b = |\mathbf{w}| \cdot d$. Thus:

$$P(Y=1|\mathbf{x}) = \sigma(|\mathbf{w}| \cdot d)$$

Larger $|\mathbf{w}|$ means the same distance produces more confident predictions. This is why regularization (which shrinks $|\mathbf{w}|$) produces gentler, less overconfident probability estimates.

distance_confidence.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
import numpy as np
import matplotlib.pyplot as plt
 
def sigmoid(z):
    return 1 / (1 + np.exp(-z))
 
# Model parameters
w = np.array([2.0, 1.0])
b = -1.0
w_norm = np.linalg.norm(w)
 
def signed_distance(x, w, b):
    """Compute signed distance from point to decision boundary."""
    return (w @ x + b) / np.linalg.norm(w)
 
def probability_from_distance(d, w_norm):
    """Convert signed distance to probability."""
    z = w_norm * d
    return sigmoid(z)
 
# Analyze relationship between distance and probability
distances = np.linspace(-3, 3, 100)
probabilities = probability_from_distance(distances, w_norm)
 
# Create visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Left: Distance vs Probability
ax1 = axes[0]
ax1.plot(distances, probabilities, 'b-', linewidth=2, label=f'||w|| = {w_norm:.2f}')
 
# Show effect of different ||w||
for w_scale in [0.5, 1.0, 2.0]:
    probs = probability_from_distance(distances, w_norm * w_scale)
    ax1.plot(distances, probs, '--', alpha=0.7, 
             label=f'||w|| = {w_norm * w_scale:.2f}')
 
ax1.axhline(y=0.5, color='gray', linestyle=':', alpha=0.5)
ax1.axvline(x=0, color='gray', linestyle=':', alpha=0.5)
ax1.fill_between([-3, 0], 0, 1, alpha=0.1, color='blue', label='Class 0 side')
ax1.fill_between([0, 3], 0, 1, alpha=0.1, color='red', label='Class 1 side')
ax1.set_xlabel('Signed Distance from Boundary')
ax1.set_ylabel('P(Y=1)')
ax1.set_title('Distance to Boundary vs Prediction Confidence')
ax1.legend(loc='upper left')
ax1.grid(True, alpha=0.3)
ax1.set_ylim(0, 1)
 
# Right: Sample points with their distances and probabilities
ax2 = axes[1]
 
# Generate some points
np.random.seed(42)
test_points = np.array([
    [0, 0.5],    # Near boundary
    [1, 0],      # On class 1 side
    [-1, 0],     # On class 0 side
    [2, 1],      # Far on class 1 side
    [-2, -1],    # Far on class 0 side
])
 
# Plot decision boundary
x1_line = np.linspace(-3, 3, 100)
x2_line = -(w[0] * x1_line + b) / w[1]
ax2.plot(x1_line, x2_line, 'k-', linewidth=2, label='Boundary')
 
# Plot points with color by probability
for i, pt in enumerate(test_points):
    d = signed_distance(pt, w, b)
    p = sigmoid(w @ pt + b)
    color = plt.cm.RdBu_r(p)
    ax2.scatter(pt[0], pt[1], c=[color], s=150, edgecolors='k', zorder=5)
    ax2.annotate(f'd={d:.2f}
P={p:.2f}', (pt[0]+0.1, pt[1]+0.15), fontsize=9)
 
# Draw normal vector
origin = np.array([0, -b/w[1]])
w_normalized = w / w_norm
ax2.quiver(origin[0], origin[1], w_normalized[0]*1.5, w_normalized[1]*1.5,
           color='green', scale=5, width=0.02, label='Normal (w/||w||)')
 
ax2.set_xlim(-3, 3)
ax2.set_ylim(-3, 3)
ax2.set_xlabel('x₁')
ax2.set_ylabel('x₂')
ax2.set_title('Sample Points: Distance and Probability')
ax2.legend(loc='upper left')
ax2.set_aspect('equal')
ax2.grid(True, alpha=0.3)
 
plt.tight_layout()
plt.savefig('distance_confidence.png', dpi=150)
plt.show()
 
# Print summary
print("Distance-Probability Relationship")
print("=" * 50)
print(f"Weight vector: {w}")
print(f"||w||: {w_norm:.4f}")
print(f"
For various distances:")
for d in [-2, -1, -0.5, 0, 0.5, 1, 2]:
    p = probability_from_distance(d, w_norm)
    print(f"  Distance {d:>5.1f} → Probability {p:.4f}")

The Role of ||w||

The magnitude ||w|| controls how 'sharp' the probability transition is across the boundary. Large ||w|| → sharp transition (confident predictions close to boundary). Small ||w|| → gradual transition (uncertain predictions far from boundary). Regularization reduces ||w||, producing softer, more calibrated probability estimates.

Margin: The Gap Between Classes

The margin is a fundamental concept linking logistic regression to Support Vector Machines (SVMs) and broader notions of classifier robustness.

Functional Margin

The functional margin of a point $(\mathbf{x}_i, y_i)$ is:

$$\gamma_i^{(f)} = y_i^* \cdot (\mathbf{w}^T\mathbf{x}_i + b)$$

where $y_i^* = 2y_i - 1 \in {-1, +1}$ converts labels to $\pm 1$.

A positive functional margin means correct classification; negative means incorrect.

Geometric Margin

The geometric margin is the functional margin normalized by $|\mathbf{w}|$:

$$\gamma_i^{(g)} = \frac{y_i^* \cdot (\mathbf{w}^T\mathbf{x}_i + b)}{|\mathbf{w}|}$$

This equals the signed distance from $\mathbf{x}_i$ to the boundary (with sign indicating correctness).

Minimum Margin

The minimum margin of a classifier is the smallest margin among all training points:

$$\gamma_{\min} = \min_i \gamma_i^{(g)}$$

This represents the 'safety buffer' of the classifier—how close the nearest correctly classified point is to the boundary.

Large Margin Benefits

•Robustness: Small input perturbations don't change predictions
•Generalization: Points near boundary are in uncertain region; pushing them away reduces ambiguous classifications
•Regularization connection: Larger margin ↔ smaller ||w|| ↔ stronger regularization

Small Margin Risks

•Sensitivity: Minor noise can flip predictions
•Overfitting: Boundary may be contorted to fit training noise
•Poor calibration: Overconfident near boundary, probabilities unreliable

Logistic Regression vs. Maximum Margin (SVM)

Logistic regression maximizes likelihood (equivalently, minimizes log-loss), which implicitly considers all points. SVMs explicitly maximize the minimum margin, focusing only on the nearest points (support vectors).

With L2 regularization, logistic regression tends toward larger margins, but it's a softer constraint than SVM's hard margin maximization. This leads to differences:

Aspect	Logistic Regression	SVM
Objective	Maximize likelihood	Maximize margin
All points matter	Yes (weighted by error)	Only support vectors
Probabilistic output	Yes (calibrated)	No (requires calibration)
Margin enforcement	Soft (via regularization)	Hard (primal constraint)

Regularization as Margin Maximization

L2 regularization equivalently maximizes margin. Minimizing ||w||² while maintaining correct predictions is exactly maximizing the geometric margin. This is why well-regularized logistic regression and SVM often produce similar decision boundaries.

Decision Boundaries in Higher Dimensions

While 2D visualizations build intuition, real problems typically have many more dimensions. The key insights transfer directly, though visualization becomes impossible.

The Hyperplane in $\mathbb{R}^d$

The decision boundary $\mathbf{w}^T\mathbf{x} + b = 0$ defines a $(d-1)$-dimensional hyperplane:

In $d = 10$ dimensions: a 9-dimensional surface
In $d = 1000$ dimensions: a 999-dimensional surface
In $d = 10000$ dimensions (e.g., text): a 9999-dimensional surface

Though we can't visualize these, their properties—linearity, normal vector $\mathbf{w}$, distance formulas—all hold.

Understanding Without Visualizing

Even when visualization fails, we can understand the boundary through:

Feature importance: Which $w_j$ are large tells us which features most influence the boundary orientation
Distance distributions: Plotting the distribution of distances from boundary for each class reveals how well-separated they are
2D projections: Projecting data onto the $\mathbf{w}$ direction (and one orthogonal direction) shows how the boundary separates classes
Probability distributions: Examining predicted probability distributions for each class shows separation quality

high_dimensional_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
 
# Generate high-dimensional data
np.random.seed(42)
n_samples = 500
n_features = 50
X, y = make_classification(n_samples=n_samples, n_features=n_features,
                            n_informative=10, n_redundant=5,
                            class_sep=1.0, random_state=42)
 
# Fit model
model = LogisticRegression(C=1.0, max_iter=1000)
model.fit(X, y)
 
w = model.coef_[0]
b = model.intercept_[0]
w_norm = np.linalg.norm(w)
 
# Compute distances from boundary for all points
z = X @ w + b
distances = z / w_norm
probabilities = 1 / (1 + np.exp(-z))
 
# Analysis
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
 
# Plot 1: Feature importance (|w|)
ax1 = axes[0, 0]
importance_order = np.argsort(np.abs(w))[::-1]
top_k = 20
ax1.barh(range(top_k), np.abs(w[importance_order[:top_k]])[::-1])
ax1.set_yticks(range(top_k))
ax1.set_yticklabels([f'Feature {importance_order[top_k-1-i]}' for i in range(top_k)])
ax1.set_xlabel('|Coefficient|')
ax1.set_title(f'Top {top_k} Feature Importances (of {n_features})')
 
# Plot 2: Distance distributions by class
ax2 = axes[0, 1]
ax2.hist(distances[y==0], bins=30, alpha=0.7, label='Class 0', color='blue')
ax2.hist(distances[y==1], bins=30, alpha=0.7, label='Class 1', color='red')
ax2.axvline(x=0, color='black', linestyle='--', linewidth=2, label='Boundary')
ax2.set_xlabel('Signed Distance from Boundary')
ax2.set_ylabel('Count')
ax2.set_title('Distance Distribution by Class')
ax2.legend()
 
# Plot 3: 1D projection onto w
ax3 = axes[1, 0]
projection = X @ w / w_norm  # Project onto unit normal
ax3.scatter(projection[y==0], np.zeros(sum(y==0)) + np.random.randn(sum(y==0))*0.1,
            alpha=0.5, c='blue', label='Class 0', s=20)
ax3.scatter(projection[y==1], np.zeros(sum(y==1)) + np.random.randn(sum(y==1))*0.1 + 1,
            alpha=0.5, c='red', label='Class 1', s=20)
ax3.axvline(x=-b/w_norm, color='black', linestyle='--', linewidth=2)
ax3.set_xlabel('Projection onto w Direction')
ax3.set_ylabel('Class (with jitter)')
ax3.set_title('Data Projected onto Normal Direction')
ax3.legend()
 
# Plot 4: Probability distributions
ax4 = axes[1, 1]
ax4.hist(probabilities[y==0], bins=30, alpha=0.7, label='Class 0', color='blue')
ax4.hist(probabilities[y==1], bins=30, alpha=0.7, label='Class 1', color='red')
ax4.axvline(x=0.5, color='black', linestyle='--', linewidth=2, label='P=0.5')
ax4.set_xlabel('Predicted Probability P(Y=1)')
ax4.set_ylabel('Count')
ax4.set_title('Probability Distribution by Class')
ax4.legend()
 
plt.tight_layout()
plt.savefig('high_dimensional_analysis.png', dpi=150)
plt.show()
 
# Summary statistics
print(f"High-Dimensional Boundary Analysis (d={n_features})")
print("=" * 60)
print(f"||w||: {w_norm:.4f}")
print(f"Number of features with |w| > 0.1: {sum(np.abs(w) > 0.1)}")
print(f"
Class separation:")
print(f"  Mean distance (Class 0): {distances[y==0].mean():.4f}")
print(f"  Mean distance (Class 1): {distances[y==1].mean():.4f}")
print(f"  Minimum margin: {min(distances[y==1].min(), -distances[y==0].max()):.4f}")

The Curse of Dimensionality

In very high dimensions, interesting phenomena occur. Most points lie near the surface of the hypercube. Distances between random points concentrate around the mean. These effects make high-dimensional classification both easier (more room to separate) and harder (fewer points per region). Understanding the boundary analytically becomes essential when visualization fails.

Limitations of Linear Decision Boundaries

The linearity of logistic regression's decision boundary is both a strength (interpretability, stability) and a limitation (inability to capture complex patterns). Understanding when linearity fails is crucial for model selection.

The XOR Problem

The classic example is the XOR (exclusive or) pattern:

$x_1$	$x_2$	$y$
0	0	0
0	1	1
1	0	1
1	1	0

No single line can separate the two classes. This is the archetypal linearly non-separable dataset.

Types of Nonlinear Patterns

Concentric classes: One class surrounds the other (e.g., inner circle vs. outer ring)
Interleaved classes: Classes wind around each other (e.g., two spirals)
Multiple clusters: One class has multiple disconnected regions
Complex boundaries: True boundary is curved, polynomial, or fractal

Recognizing Linear Inseparability

Signs that a linear boundary may be insufficient:

Training accuracy plateaus well below 100%
Decision boundary cuts through dense clusters of one class
Residual analysis shows systematic spatial patterns
Adding polynomial features improves performance significantly

linear_limitations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import PolynomialFeatures
 
# Generate XOR-like data
np.random.seed(42)
n = 200
X_xor = np.vstack([
    np.random.randn(n//4, 2) + [1, 1],
    np.random.randn(n//4, 2) + [-1, -1],
    np.random.randn(n//4, 2) + [1, -1],
    np.random.randn(n//4, 2) + [-1, 1],
])
y_xor = np.array([0, 0, 1, 1] * (n//4))
 
# Generate concentric circles
from sklearn.datasets import make_circles
X_circles, y_circles = make_circles(n_samples=n, noise=0.1, factor=0.5, random_state=42)
 
# Create figure
fig, axes = plt.subplots(2, 3, figsize=(14, 9))
 
datasets = [
    (X_xor, y_xor, 'XOR Pattern'),
    (X_circles, y_circles, 'Concentric Circles'),
]
 
for row, (X, y, title) in enumerate(datasets):
    # Original with linear boundary
    ax = axes[row, 0]
    model_linear = LogisticRegression()
    model_linear.fit(X, y)
    acc_linear = model_linear.score(X, y)
    
    x1r = np.linspace(X[:,0].min()-0.5, X[:,0].max()+0.5, 100)
    x2r = np.linspace(X[:,1].min()-0.5, X[:,1].max()+0.5, 100)
    X1, X2 = np.meshgrid(x1r, x2r)
    Z = model_linear.predict(np.c_[X1.ravel(), X2.ravel()]).reshape(X1.shape)
    
    ax.contourf(X1, X2, Z, alpha=0.3, cmap='RdBu')
    ax.scatter(X[y==0,0], X[y==0,1], c='blue', s=20, edgecolors='k')
    ax.scatter(X[y==1,0], X[y==1,1], c='red', s=20, edgecolors='k')
    ax.set_title(f'{title}
Linear: {acc_linear:.1%}')
    ax.set_aspect('equal')
    
    # With polynomial features (degree 2)
    ax = axes[row, 1]
    poly = PolynomialFeatures(degree=2)
    X_poly = poly.fit_transform(X)
    model_poly2 = LogisticRegression(max_iter=1000)
    model_poly2.fit(X_poly, y)
    acc_poly2 = model_poly2.score(X_poly, y)
    
    Z_poly = model_poly2.predict(poly.transform(np.c_[X1.ravel(), X2.ravel()])).reshape(X1.shape)
    ax.contourf(X1, X2, Z_poly, alpha=0.3, cmap='RdBu')
    ax.scatter(X[y==0,0], X[y==0,1], c='blue', s=20, edgecolors='k')
    ax.scatter(X[y==1,0], X[y==1,1], c='red', s=20, edgecolors='k')
    ax.set_title(f'Poly Degree 2: {acc_poly2:.1%}')
    ax.set_aspect('equal')
    
    # With polynomial features (degree 3)
    ax = axes[row, 2]
    poly3 = PolynomialFeatures(degree=3)
    X_poly3 = poly3.fit_transform(X)
    model_poly3 = LogisticRegression(max_iter=1000, C=10)
    model_poly3.fit(X_poly3, y)
    acc_poly3 = model_poly3.score(X_poly3, y)
    
    Z_poly3 = model_poly3.predict(poly3.transform(np.c_[X1.ravel(), X2.ravel()])).reshape(X1.shape)
    ax.contourf(X1, X2, Z_poly3, alpha=0.3, cmap='RdBu')
    ax.scatter(X[y==0,0], X[y==0,1], c='blue', s=20, edgecolors='k')
    ax.scatter(X[y==1,0], X[y==1,1], c='red', s=20, edgecolors='k')
    ax.set_title(f'Poly Degree 3: {acc_poly3:.1%}')
    ax.set_aspect('equal')
 
plt.suptitle('Linear Limitations and Polynomial Feature Solutions', fontsize=14, y=1.02)
plt.tight_layout()
plt.savefig('linear_limitations.png', dpi=150)
plt.show()

Feature Engineering as Workaround

Polynomial features, interaction terms, and domain-specific transformations can make logistic regression capture nonlinear patterns. But this increases dimensionality exponentially (degree-d polynomials in p features → O(p^d) features), risking overfitting and computational explosion. For complex nonlinear patterns, consider kernel methods, neural networks, or tree-based models.

Adjusting the Decision Threshold

By default, we classify as class 1 when $P(Y=1|\mathbf{x}) > 0.5$. But this threshold can be adjusted to balance different types of errors.

The Standard Threshold

With threshold $\tau = 0.5$: $$\hat{y} = \begin{cases} 1 & \text{if } P(Y=1|\mathbf{x}) > 0.5 \ 0 & \text{otherwise} \end{cases}$$

This corresponds to classifying based on which class has higher probability—the Bayes optimal rule when costs are equal.

Adjusting the Threshold

With threshold $\tau eq 0.5$: $$\hat{y} = \begin{cases} 1 & \text{if } P(Y=1|\mathbf{x}) > \tau \ 0 & \text{otherwise} \end{cases}$$

Effect on Decision Boundary

Changing $\tau$ shifts the effective decision boundary:

$\tau < 0.5$: Boundary shifts toward class 0 region (more class 1 predictions)
$\tau > 0.5$: Boundary shifts toward class 1 region (more class 0 predictions)

Mathematically, predicting class 1 when $\sigma(z) > \tau$ is equivalent to: $$z > \sigma^{-1}(\tau) = \log\left(\frac{\tau}{1-\tau}\right)$$

The new boundary is $z = \sigma^{-1}(\tau)$ instead of $z = 0$.

threshold_adjustment.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score
 
# Generate imbalanced data
np.random.seed(42)
n = 1000
X = np.random.randn(n, 2)
# Imbalanced: 20% class 1
z = 0.8 * X[:, 0] - 0.5 * X[:, 1] - 1.5  # Shifted to create imbalance
y = (1 / (1 + np.exp(-z)) > np.random.rand(n)).astype(int)
 
print(f"Class distribution: Class 0: {sum(y==0)}, Class 1: {sum(y==1)}")
 
# Fit model
model = LogisticRegression()
model.fit(X, y)
 
probabilities = model.predict_proba(X)[:, 1]
 
# Evaluate at different thresholds
thresholds = [0.3, 0.5, 0.7]
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
 
print("
Metrics at Different Thresholds")
print("=" * 70)
print(f"{'Threshold':<12} | {'Precision':>10} | {'Recall':>10} | {'F1':>10} | {'TP':>6} | {'FP':>6}")
print("-" * 70)
 
for i, tau in enumerate(thresholds):
    predictions = (probabilities > tau).astype(int)
    
    precision = precision_score(y, predictions, zero_division=0)
    recall = recall_score(y, predictions, zero_division=0)
    f1 = f1_score(y, predictions, zero_division=0)
    cm = confusion_matrix(y, predictions)
    
    print(f"{tau:<12} | {precision:>10.4f} | {recall:>10.4f} | {f1:>10.4f} | {cm[1,1]:>6} | {cm[0,1]:>6}")
    
    # Visualize
    ax = axes[i]
    x1r = np.linspace(X[:,0].min()-1, X[:,0].max()+1, 100)
    x2r = np.linspace(X[:,1].min()-1, X[:,1].max()+1, 100)
    X1, X2 = np.meshgrid(x1r, x2r)
    Z = model.predict_proba(np.c_[X1.ravel(), X2.ravel()])[:, 1].reshape(X1.shape)
    
    ax.contourf(X1, X2, Z > tau, alpha=0.3, cmap='RdBu')
    ax.contour(X1, X2, Z, levels=[tau], colors='black', linewidths=2)
    ax.scatter(X[y==0,0], X[y==0,1], c='blue', s=15, alpha=0.6, label='Class 0')
    ax.scatter(X[y==1,0], X[y==1,1], c='red', s=15, alpha=0.6, label='Class 1')
    ax.set_title(f'Threshold τ = {tau}
Precision={precision:.2f}, Recall={recall:.2f}')
    ax.set_xlabel('x₁')
    ax.set_ylabel('x₂')
    if i == 0:
        ax.legend(loc='upper left')
 
plt.tight_layout()
plt.savefig('threshold_adjustment.png', dpi=150)
plt.show()
 
# Log-odds of different thresholds
print("
Log-odds (z) at different thresholds:")
for tau in [0.1, 0.3, 0.5, 0.7, 0.9]:
    z_thresh = np.log(tau / (1 - tau))
    print(f"  τ = {tau}: z = {z_thresh:.4f}")

When to Adjust the Threshold

Adjust the threshold when: (1) Classes are imbalanced and the minority class matters more, (2) False positives and false negatives have different costs (e.g., medical diagnosis), (3) You need to meet a specific precision or recall target. Use ROC curves and precision-recall curves to select the optimal threshold for your use case.

Summary: The Decision Boundary

We've explored the decision boundary from multiple perspectives—mathematical, geometric, and practical. This geometric understanding is essential for reasoning about classifier behavior and limitations. Let's consolidate the key insights:

Key Takeaways

•Definition: The decision boundary is {x : w^T x + b = 0}, a (d-1)-dimensional hyperplane where P(Y=1) = 0.5.
•Geometry: w is the normal vector (perpendicular to boundary), ||w|| controls transition sharpness, b offsets from origin.
•Distance-Confidence Link: Signed distance d = (w^T x + b)/||w|| directly relates to confidence. Points far from boundary get extreme probabilities.
•Margin: The minimum distance of correctly classified points to the boundary. Larger margins generally mean better generalization.
•Linearity Limitation: Logistic regression can only produce linear boundaries. XOR, concentric circles, and other nonlinear patterns require feature engineering or different models.
•Higher Dimensions: The same geometry applies but visualization fails. Analyze through feature importance, distance distributions, and projections.
•Threshold Adjustment: Changing from 0.5 shifts the effective boundary, trading precision for recall. Essential for imbalanced problems.
•Regularization Effect: Smaller ||w|| (from regularization) produces wider uncertainty bands around the boundary.

What's Next:

Having understood the decision boundary, we conclude this module with the probabilistic interpretation page—examining how logistic regression produces calibrated probabilities, what calibration means, and why probabilistic outputs are often more valuable than hard classifications.

Page Complete

You now have a thorough geometric understanding of logistic regression's decision boundary—how it's defined, visualized, and related to prediction confidence. This perspective is invaluable for model interpretation, debugging, and knowing when linear models are insufficient.