Machine LearningSupport Vector Machines

Maximum Margin Classifier

LevelIntermediate

Duration90 mins

TopicSupport Vector Machines

2 / 5

Functional Margin: The Raw Classification Score

From Geometric Distance to Algebraic Convenience

In the previous page, we established the geometric margin—the perpendicular distance from a point to the decision hyperplane. While geometrically intuitive, the geometric margin involves a division by $|\mathbf{w}|$, which complicates optimization.

Enter the functional margin: a simpler, unnormalized quantity that captures the essence of margin while being more amenable to mathematical manipulation. Understanding the interplay between functional and geometric margins is essential for deriving the SVM optimization problem and appreciating why the formulation takes its elegant final form.

What You Will Learn

By the end of this page, you will understand: (1) The definition and intuition behind functional margin, (2) How functional margin relates to geometric margin, (3) The canonical scaling convention that eliminates scaling ambiguity, (4) Why we need both concepts for SVM formulation, and (5) How the functional margin leads naturally to the maximum margin optimization problem.

Definition of Functional Margin

The functional margin is simply the raw, unnormalized classification score multiplied by the true label.

Definition (Functional Margin for a Single Point):

For a data point $(\mathbf{x}_i, y_i)$ where $y_i \in {-1, +1}$, and a hyperplane defined by $(\mathbf{w}, b)$, the functional margin is:

$$\hat{\gamma}_i = y_i(\mathbf{w}^T\mathbf{x}_i + b)$$

Definition (Dataset Functional Margin):

For a dataset ${(\mathbf{x}i, y_i)}{i=1}^n$, the functional margin is:

$$\hat{\gamma} = \min_{i=1,...,n} \hat{\gamma}i = \min{i=1,...,n} y_i(\mathbf{w}^T\mathbf{x}_i + b)$$

Interpreting the Functional Margin

The functional margin measures how "strongly" a point is classified correctly. A larger positive value indicates stronger classification confidence in the unnormalized sense. Unlike geometric margin (which is in distance units), functional margin is in arbitrary units that depend on the scale of w.

Breaking down the formula:

Let's understand each component:

$\mathbf{w}^T\mathbf{x}_i + b$: The raw linear classifier output
- Positive values suggest positive class
- Negative values suggest negative class
- Magnitude indicates "how far" from the boundary (in scaled units)
$y_i$: The true label, either +1 or -1
- Multiplying by $y_i$ converts the output into a "correctness score"
Product $y_i(\mathbf{w}^T\mathbf{x}_i + b)$:
- Positive if classification agrees with true label
- Negative if classification disagrees with true label
- Larger magnitude = stronger agreement/disagreement

Functional Margin Sign Analysis
True Label $y_i$	Classifier Output	Product $\hat{\gamma}_i$	Interpretation
+1	Positive (+)	Positive ✓	Correct classification
+1	Negative (-)	Negative ✗	Misclassification
-1	Positive (+)	Negative ✗	Misclassification
-1	Negative (-)	Positive ✓	Correct classification

Example calculation:

Consider a hyperplane with $\mathbf{w} = [2, 3]^T$ and $b = -1$.

For point $\mathbf{x}_1 = [1, 1]^T$ with label $y_1 = +1$: $$\hat{\gamma}_1 = (+1)(2 \cdot 1 + 3 \cdot 1 - 1) = (+1)(4) = 4$$

The functional margin is 4, indicating correct classification with some confidence.

For point $\mathbf{x}_2 = [0, 0]^T$ with label $y_2 = +1$: $$\hat{\gamma}_2 = (+1)(2 \cdot 0 + 3 \cdot 0 - 1) = (+1)(-1) = -1$$

The functional margin is -1, indicating misclassification.

Relationship Between Functional and Geometric Margins

The connection between functional and geometric margins is straightforward but critically important.

The fundamental relationship:

$$\gamma_i = \frac{\hat{\gamma}_i}{|\mathbf{w}|}$$

Or equivalently:

$$\hat{\gamma}_i = |\mathbf{w}| \cdot \gamma_i$$

Geometric margin (true distance): $$\gamma_i = y_i \cdot \frac{\mathbf{w}^T\mathbf{x}_i + b}{|\mathbf{w}|}$$

Functional margin (scaled score): $$\hat{\gamma}_i = y_i(\mathbf{w}^T\mathbf{x}_i + b) = |\mathbf{w}| \cdot \gamma_i$$

The Key Insight

Functional margin = Geometric margin × ||w||. This means the functional margin is a "scaled" version of the geometric margin. The scaling factor is the norm of the weight vector. This relationship is crucial because it lets us convert between the two and exploit the advantages of each.

Why do we need both?

Each margin concept serves a different purpose:

Aspect	Geometric Margin	Functional Margin
Interpretation	True Euclidean distance	Scaled confidence score
Units	Same as feature space	Arbitrary (depends on $\|\mathbf{w}\|$)
Scale invariance	Yes (invariant to rescaling $\mathbf{w}$)	No (changes with $\mathbf{w}$ scale)
Optimization	Division by $\|\mathbf{w}\|$ complicates	Simpler algebraic form
Generalization	Directly appears in bounds	Related via normalization

The scaling problem:

A significant issue with functional margin is its scale ambiguity. If we replace $(\mathbf{w}, b)$ with $(c\mathbf{w}, cb)$ for any $c > 0$, the functional margin becomes:

$$\hat{\gamma}'_i = y_i((c\mathbf{w})^T\mathbf{x}_i + cb) = c \cdot y_i(\mathbf{w}^T\mathbf{x}_i + b) = c \cdot \hat{\gamma}_i$$

The functional margin scales linearly with $c$! This means we can artificially inflate the functional margin by simply scaling $\mathbf{w}$ and $b$, without actually improving the classifier.

In contrast, the geometric margin is unchanged: $$\gamma'_i = \frac{c \cdot \hat{\gamma}_i}{|c\mathbf{w}|} = \frac{c \cdot \hat{\gamma}_i}{|c| \cdot |\mathbf{w}|} = \frac{\hat{\gamma}_i}{|\mathbf{w}|} = \gamma_i$$

This scale invariance makes geometric margin the "true" measure of separation quality.

margin_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
import numpy as np
from typing import Tuple
 
def compare_margins(
    x: np.ndarray,
    y: int,
    w: np.ndarray,
    b: float,
    scale_factors: list = [1, 2, 5, 10]
) -> None:
    """
    Demonstrate the different behaviors of functional and geometric margins
    under rescaling of the hyperplane parameters.
    
    Parameters:
    -----------
    x : np.ndarray
        A single data point
    y : int
        True label (-1 or +1)
    w : np.ndarray
        Weight vector
    b : float
        Bias term
    scale_factors : list
        Different scaling factors to apply
    """
    print("Margin Behavior Under Rescaling")
    print("=" * 70)
    print(f"Point: {x}")
    print(f"True label: {y}")
    print(f"Original w: {w}, b: {b}")
    print()
    
    print(f"{'Scale c':<10} {'||cw||':<12} {'Functional':<15} {'Geometric':<15}")
    print("-" * 55)
    
    for c in scale_factors:
        w_scaled = c * w
        b_scaled = c * b
        
        w_norm = np.linalg.norm(w_scaled)
        
        # Functional margin
        functional_margin = y * (np.dot(w_scaled, x) + b_scaled)
        
        # Geometric margin
        geometric_margin = functional_margin / w_norm
        
        print(f"{c:<10} {w_norm:<12.4f} {functional_margin:<15.4f} {geometric_margin:<15.4f}")
    
    print()
    print("Observation: Functional margin scales with c, geometric margin stays constant!")
 
 
def margin_relationship_demo():
    """
    Demonstrate that geometric_margin = functional_margin / ||w||
    """
    # Example hyperplane
    w = np.array([3.0, 4.0])  # ||w|| = 5
    b = -2.0
    
    # Example points
    points = [
        (np.array([2.0, 1.0]), +1, "Inside positive region"),
        (np.array([0.0, 0.0]), +1, "Near boundary, positive"),
        (np.array([-1.0, 0.0]), -1, "Inside negative region"),
        (np.array([0.5, 0.5]), +1, "Close to boundary"),
    ]
    
    w_norm = np.linalg.norm(w)
    
    print("\nMargin Relationship Verification")
    print("=" * 80)
    print(f"Hyperplane: {w[0]}x + {w[1]}y + {b} = 0")
    print(f"||w|| = {w_norm}")
    print()
    
    print(f"{'Description':<25} {'Functional':<12} {'÷ ||w||':<12} {'= Geometric':<12}")
    print("-" * 65)
    
    for x, y_true, desc in points:
        functional = y_true * (np.dot(w, x) + b)
        geometric = functional / w_norm
        
        print(f"{desc:<25} {functional:<12.4f} {'/ ' + str(w_norm):<12} {geometric:<12.4f}")
    
    print()
    print("Verified: γ_geometric = γ_functional / ||w||")
 
 
if __name__ == "__main__":
    # Demo 1: Scale invariance of geometric margin
    x = np.array([1.0, 1.0])
    y = +1
    w = np.array([1.0, 2.0])
    b = -1.0
    
    compare_margins(x, y, w, b)
    
    # Demo 2: Relationship between margins
    margin_relationship_demo()

The Canonical Scaling Convention

The scale ambiguity of functional margins poses a challenge: if we can arbitrarily inflate functional margins without improving the classifier, how do we formulate a meaningful optimization problem?

The solution is elegant: we impose a canonical scaling constraint.

The Canonical Constraint:

We require that the minimum functional margin equals 1:

$$\min_{i=1,...,n} y_i(\mathbf{w}^T\mathbf{x}_i + b) = 1$$

Equivalently, for all training points: $$y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1 \quad \forall i = 1, ..., n$$

with equality for at least one point.

Why This Works

This constraint fixes the scale of (w, b) by pinning the minimum functional margin to 1. For any separating hyperplane, we can always rescale to achieve this. The constraint removes the scaling degree of freedom, making the optimization well-defined.

Proof that this is achievable:

Suppose we have a separating hyperplane $(\mathbf{w}, b)$ with positive minimum functional margin $\hat{\gamma} = \min_i y_i(\mathbf{w}^T\mathbf{x}_i + b) > 0$.

We can rescale by $c = 1/\hat{\gamma}$:

New weights: $\mathbf{w}' = \mathbf{w}/\hat{\gamma}$
New bias: $b' = b/\hat{\gamma}$

Now the minimum functional margin becomes: $$\min_i y_i(\mathbf{w}'^T\mathbf{x}_i + b') = \min_i \frac{y_i(\mathbf{w}^T\mathbf{x}_i + b)}{\hat{\gamma}} = \frac{\hat{\gamma}}{\hat{\gamma}} = 1$$

Connecting back to geometric margin:

With the canonical scaling $\hat{\gamma} = 1$, the geometric margin becomes: $$\gamma = \frac{\hat{\gamma}}{|\mathbf{w}|} = \frac{1}{|\mathbf{w}|}$$

This is a crucial result! Under canonical scaling:

$$\boxed{\gamma = \frac{1}{|\mathbf{w}|}}$$

The Magic of Canonical Scaling

With canonical scaling, maximizing geometric margin γ = 1/||w|| is equivalent to minimizing ||w||. Even better, minimizing ||w||² leads to the same optimal hyperplane but yields a quadratic optimization problem—which is much easier to solve than the original non-convex problem!

Geometric interpretation of canonical scaling:

The constraint $y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1$ defines two "margin hyperplanes":

Positive margin hyperplane: $\mathbf{w}^T\mathbf{x} + b = +1$
Negative margin hyperplane: $\mathbf{w}^T\mathbf{x} + b = -1$

These are parallel to the decision boundary $\mathbf{w}^T\mathbf{x} + b = 0$ and are equidistant from it.

The constraints ensure:

All positive examples satisfy $\mathbf{w}^T\mathbf{x} + b \geq +1$ (on or beyond positive margin hyperplane)
All negative examples satisfy $\mathbf{w}^T\mathbf{x} + b \leq -1$ (on or beyond negative margin hyperplane)

The "margin corridor" or "street" is the region between these hyperplanes: $$-1 \leq \mathbf{w}^T\mathbf{x} + b \leq +1$$

The width of this corridor is: $$\text{width} = \frac{2}{|\mathbf{w}|}$$

Maximizing the margin width = maximizing $1/|\mathbf{w}|$ = minimizing $|\mathbf{w}|$.

Canonical Scaling Summary
Concept	Expression	Interpretation
Canonical constraint	$\min_i \hat{\gamma}_i = 1$	Fixes scale of parameters
Margin hyperplanes	$\mathbf{w}^T\mathbf{x} + b = \pm 1$	Boundaries of margin corridor
Corridor width	$2/\|\mathbf{w}\|$	Total separation between classes
Geometric margin	$1/\|\mathbf{w}\|$	Half-width = distance to boundary
Optimization target	Minimize $\|\mathbf{w}\|^2$	Maximize margin (equivalent)

From Margins to the Optimization Problem

Now we can see how the two margin concepts lead naturally to the SVM optimization problem.

The maximum margin objective (intuitive form):

We want to find the hyperplane that maximizes the geometric margin:

$$\max_{\mathbf{w}, b} \gamma \quad \text{subject to} \quad y_i \cdot \frac{\mathbf{w}^T\mathbf{x}_i + b}{|\mathbf{w}|} \geq \gamma \quad \forall i$$

This says: maximize $\gamma$ while ensuring all points are at least distance $\gamma$ from the boundary.

Conversion to canonical form:

Using the relationship $\gamma = \hat{\gamma}/|\mathbf{w}|$ and imposing canonical scaling $\hat{\gamma} = 1$:

$$\max_{\mathbf{w}, b} \frac{1}{|\mathbf{w}|} \quad \text{subject to} \quad y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1 \quad \forall i$$

Final optimization form:

Maximizing $1/|\mathbf{w}|$ is equivalent to minimizing $|\mathbf{w}|$, which is equivalent to minimizing $\frac{1}{2}|\mathbf{w}|^2$ (the factor of 1/2 simplifies derivatives):

$$\min_{\mathbf{w}, b} \frac{1}{2}|\mathbf{w}|^2 \quad \text{subject to} \quad y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1 \quad \forall i$$

The Hard-Margin SVM Primal Problem

This is the hard-margin SVM optimization problem in its primal form. It's a convex quadratic program (QP): quadratic objective, linear constraints. Convexity guarantees a unique global optimum—the maximum margin hyperplane.

Understanding the formulation:

Objective: $\frac{1}{2}|\mathbf{w}|^2$
- Minimizing this maximizes $\gamma = 1/|\mathbf{w}|$
- The $\frac{1}{2}$ factor is conventional (simplifies gradients)
- Quadratic = convex = unique solution exists
Constraints: $y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1$
- Ensures canonical scaling (minimum functional margin = 1)
- Ensures all points are correctly classified
- Linear in $\mathbf{w}$ and $b$ = convex feasible region
Why this works:
- Constraints fix the scale → remove ambiguity
- Objective directly measures inverse of geometric margin
- Convex program → efficient algorithms exist

margin_optimization_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import minimize
from typing import Tuple, Optional
 
def hard_margin_svm_primal(
    X: np.ndarray,
    y: np.ndarray
) -> Tuple[np.ndarray, float, float]:
    """
    Solve the hard-margin SVM primal problem directly.
    
    min_{w,b} (1/2) ||w||^2
    s.t.      y_i(w^T x_i + b) >= 1 for all i
    
    Parameters:
    -----------
    X : np.ndarray
        Feature matrix (n_samples, n_features)
    y : np.ndarray
        Labels in {-1, +1}
    
    Returns:
    --------
    Tuple[np.ndarray, float, float]
        (optimal w, optimal b, optimal margin)
    """
    n_samples, n_features = X.shape
    
    # Decision variables: [w_1, w_2, ..., w_d, b]
    n_vars = n_features + 1
    
    def objective(params):
        w = params[:n_features]
        return 0.5 * np.dot(w, w)
    
    def objective_grad(params):
        grad = np.zeros(n_vars)
        grad[:n_features] = params[:n_features]  # Gradient w.r.t. w
        return grad
    
    # Inequality constraints: y_i(w^T x_i + b) >= 1
    # scipy minimize uses <= 0, so we write: 1 - y_i(w^T x_i + b) <= 0
    constraints = []
    
    for i in range(n_samples):
        def constraint(params, i=i):
            w = params[:n_features]
            b = params[n_features]
            return 1 - y[i] * (np.dot(w, X[i]) + b)
        
        constraints.append({
            'type': 'ineq',
            'fun': lambda params, i=i: -constraint(params, i)  # >= 0
        })
    
    # Initial guess
    x0 = np.zeros(n_vars)
    
    # Solve using SLSQP
    result = minimize(
        objective,
        x0,
        method='SLSQP',
        jac=objective_grad,
        constraints=constraints,
        options={'ftol': 1e-10, 'maxiter': 1000}
    )
    
    w_opt = result.x[:n_features]
    b_opt = result.x[n_features]
    w_norm = np.linalg.norm(w_opt)
    margin = 1.0 / w_norm if w_norm > 0 else float('inf')
    
    return w_opt, b_opt, margin
 
 
def verify_constraints_and_margins(
    X: np.ndarray,
    y: np.ndarray,
    w: np.ndarray,
    b: float
) -> dict:
    """
    Verify that the SVM solution satisfies all constraints
    and compute margin statistics.
    """
    functional_margins = y * (X @ w + b)
    w_norm = np.linalg.norm(w)
    geometric_margins = functional_margins / w_norm
    
    min_functional = np.min(functional_margins)
    min_geometric = np.min(geometric_margins)
    
    all_satisfied = np.all(functional_margins >= 1 - 1e-6)
    
    # Identify support vectors (points with margin close to 1)
    support_mask = np.abs(functional_margins - 1.0) < 0.01
    
    return {
        'functional_margins': functional_margins,
        'geometric_margins': geometric_margins,
        'min_functional': min_functional,
        'min_geometric': min_geometric,
        'predicted_margin': 1.0 / w_norm,
        'all_constraints_satisfied': all_satisfied,
        'support_vector_indices': np.where(support_mask)[0],
        'n_support_vectors': np.sum(support_mask),
    }
 
 
# Example demonstration
if __name__ == "__main__":
    # Create linearly separable data
    np.random.seed(42)
    
    # Positive class: centered at (2, 2)
    X_pos = np.random.randn(10, 2) * 0.5 + np.array([2, 2])
    y_pos = np.ones(10)
    
    # Negative class: centered at (-2, -2)
    X_neg = np.random.randn(10, 2) * 0.5 + np.array([-2, -2])
    y_neg = -np.ones(10)
    
    X = np.vstack([X_pos, X_neg])
    y = np.hstack([y_pos, y_neg])
    
    # Solve hard-margin SVM
    w_opt, b_opt, margin = hard_margin_svm_primal(X, y)
    
    print("Hard-Margin SVM Solution")
    print("=" * 50)
    print(f"Optimal w: {w_opt}")
    print(f"Optimal b: {b_opt}")
    print(f"||w||: {np.linalg.norm(w_opt):.6f}")
    print(f"Geometric margin (1/||w||): {margin:.6f}")
    print()
    
    # Verify solution
    verification = verify_constraints_and_margins(X, y, w_opt, b_opt)
    
    print("Constraint Verification")
    print("-" * 50)
    print(f"All constraints satisfied: {verification['all_constraints_satisfied']}")
    print(f"Min functional margin: {verification['min_functional']:.6f} (should be ~1)")
    print(f"Min geometric margin: {verification['min_geometric']:.6f}")
    print(f"Predicted margin (1/||w||): {verification['predicted_margin']:.6f}")
    print(f"Number of support vectors: {verification['n_support_vectors']}")
    print(f"Support vector indices: {verification['support_vector_indices']}")

Why Do We Need Both Margin Concepts?

At this point, you might wonder: if geometric margin is the "true" margin, why bother with functional margin at all? The answer reveals beautiful mathematical convenience that makes SVM tractable.

Geometric Margin Advantages

•True distance: Represents actual Euclidean distance in feature space
•Scale invariant: Unchanged by rescaling of (w, b)
•Generalization: Directly appears in PAC-learning bounds
•Interpretable: Easy to visualize and explain
•Comparable: Can compare margins across different hyperplanes

Functional Margin Advantages

•Algebraically simple: No division in computation
•Linear constraints: y_i(w^Tx_i + b) ≥ 1 is linear in (w, b)
•Convex program: Leads to efficient optimization
•Canonical scaling: Fixes scale via min margin = 1
•Gradient-friendly: Easy to compute gradients for optimization

The Synergy

Functional margin provides algebraic convenience for optimization; geometric margin provides the meaningful quantity. The canonical scaling constraint bridges them: when functional margin is fixed at 1, geometric margin becomes 1/||w||, and the optimization simplifies to minimizing ||w||². Best of both worlds!

The optimization formulation journey:

Start with geometric margin objective: $$\max_{\mathbf{w},b} \min_i y_i \frac{\mathbf{w}^T\mathbf{x}_i + b}{|\mathbf{w}|}$$
- Non-convex due to division by $|\mathbf{w}|$
- Hard to optimize directly
Introduce functional margin:
- Define $\hat{\gamma}_i = y_i(\mathbf{w}^T\mathbf{x}_i + b)$
- Recognize: $\gamma_i = \hat{\gamma}_i / |\mathbf{w}|$
Apply canonical scaling:
- Rescale so $\min_i \hat{\gamma}_i = 1$
- Now: $\gamma = 1/|\mathbf{w}|$
- Constraint: $\hat{\gamma}_i \geq 1$ for all $i$
Reformulate as minimization: $$\min_{\mathbf{w},b} \frac{1}{2}|\mathbf{w}|^2 \quad \text{s.t.} \quad y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1$$
- Convex quadratic objective
- Linear constraints
- Unique global solution exists!

Summary: The functional margin is a computational convenience that, when combined with canonical scaling, transforms a geometrically motivated but hard problem into an elegant convex program.

Preview: Functional Margin in the Dual Formulation

While we'll cover the dual formulation in detail later, it's worth previewing how functional margins appear there, as this connection is fundamental to understanding SVMs.

The Lagrangian approach:

Starting from the primal: $$\min_{\mathbf{w},b} \frac{1}{2}|\mathbf{w}|^2 \quad \text{s.t.} \quad y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1 \quad \forall i$$

We introduce Lagrange multipliers $\alpha_i \geq 0$ for each constraint:

$$\mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha}) = \frac{1}{2}|\mathbf{w}|^2 - \sum_{i=1}^n \alpha_i \left[ y_i(\mathbf{w}^T\mathbf{x}_i + b) - 1 \right]$$

Notice how the functional margin $y_i(\mathbf{w}^T\mathbf{x}_i + b)$ appears directly!

The KKT conditions reveal:

For the optimal solution, the KKT complementary slackness condition states:

$$\alpha_i \left[ y_i(\mathbf{w}^T\mathbf{x}_i + b) - 1 \right] = 0 \quad \forall i$$

This means:

If $\alpha_i > 0$, then $y_i(\mathbf{w}^T\mathbf{x}_i + b) = 1$ (point is on the margin)
If $y_i(\mathbf{w}^T\mathbf{x}_i + b) > 1$, then $\alpha_i = 0$ (point is beyond the margin)

Support Vectors Defined

The points with αᵢ > 0 are the support vectors. They lie exactly on the margin hyperplanes (functional margin = 1, geometric margin = 1/||w||). Non-support vectors have αᵢ = 0 and don't influence the solution. This is why SVMs are "sparse" and efficient.

Connection to geometric margin:

For a support vector $\mathbf{x}_i$:

Functional margin: $\hat{\gamma}_i = y_i(\mathbf{w}^T\mathbf{x}_i + b) = 1$
Geometric margin: $\gamma_i = 1/|\mathbf{w}|$

All support vectors are equidistant from the decision boundary—they define the "width" of the margin corridor. This is why the margin can be computed from any support vector.

The dual problem (preview):

Eliminating $\mathbf{w}$ and $b$ from the Lagrangian leads to the dual:

$$\max_{\boldsymbol{\alpha}} \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T\mathbf{x}j$$ $$\text{s.t.} \quad \alpha_i \geq 0, \quad \sum{i=1}^n \alpha_i y_i = 0$$

Remarkably, data points appear only in inner products $\mathbf{x}_i^T\mathbf{x}_j$. This observation leads to the kernel trick, enabling nonlinear SVMs. But the story begins here, with the functional margin constraints that make the primal formulation tractable.

Summary: Role of Functional Margin in SVM Development
Stage	Role of Functional Margin	Key Insight
Problem formulation	Defines canonical scaling	Removes scale ambiguity
Primal constraints	Linear inequalities $y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1$	Enables convex QP
Lagrangian	Appears in constraint terms	Couples $\alpha_i$ to margin violations
KKT conditions	Support vectors have margin = 1	Defines sparse solution
Dual formulation	Implicitly encodes margin structure	Enables kernel trick

Common Misconceptions About Margins

Even experienced practitioners sometimes confuse the two margin concepts or misunderstand their roles. Let's clarify common misconceptions.

Misconception #1: "Functional margin IS the margin"

•The mistake: Treating functional margin as the measure of separation quality
•The problem: Functional margin can be arbitrarily increased by scaling (w, b)
•The truth: Geometric margin is the true distance; functional margin is a computational convenience
•How to recognize: If changing the scale of w changes your "margin," you're using functional margin

Misconception #2: "We minimize ||w||² because it's a regularizer"

•The partial truth: Minimizing ||w||² does have a regularizing effect
•The deeper story: In hard-margin SVM, minimizing ||w||² is maximizing the geometric margin
•The connection: Under canonical scaling, margin = 1/||w||, so min ||w|| = max margin
•In soft-margin: Then it's both margin maximization AND regularization

Misconception #3: "All points contribute equally to the margin"

•The mistake: Thinking every training point matters for the solution
•The truth: Only support vectors (points with functional margin = 1) determine the hyperplane
•The implication: You could remove all non-support vectors without changing the solution
•Why this matters: Explains SVM's sparsity and efficiency

The Normalization Trap

When comparing SVMs or reporting margins, always use geometric margin (normalized by ||w||). Two SVMs with different ||w|| can have the same geometric margin but wildly different functional margins. Raw decision function outputs are not comparable across models!

Misconception #4: "Larger functional margin always means better classifier"

•The mistake: Using functional margin to compare classifiers
•Counter-example: Scale w by 1000 → functional margins increase 1000x, but classifier is identical
•The fix: Compare geometric margins, or ensure same canonical scaling
•In practice: Most libraries (sklearn, etc.) return raw decision values, not normalized margins

Summary: Functional Margin and Its Role

We've established the functional margin as the computational companion to the geometric margin. Let's consolidate the key insights:

Key Takeaways

•Functional margin is the unnormalized score: $\hat{\gamma}_i = y_i(\mathbf{w}^T\mathbf{x}_i + b)$, without division by $|\mathbf{w}|$
•The relationship is simple: $\gamma_i = \hat{\gamma}_i / |\mathbf{w}|$, where $\gamma_i$ is geometric margin
•Functional margin has scale ambiguity: Can be artificially inflated by scaling $(\mathbf{w}, b)$
•Canonical scaling fixes the scale: Require $\min_i \hat{\gamma}_i = 1$, then $\gamma = 1/|\mathbf{w}|$
•The optimization becomes tractable: Minimize $\frac{1}{2}|\mathbf{w}|^2$ subject to $y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1$
•Support vectors are defined by functional margin: Points with $\hat{\gamma}_i = 1$ are the support vectors

What's next:

With both margin concepts established, we're ready to formalize the maximum margin objective. In the next page, we'll see how the goal of maximizing geometric margin leads to an elegant convex optimization problem, and why this particular formulation has unique and powerful properties.

Page Complete

You now understand functional margins and their relationship to geometric margins. The canonical scaling convention transforms the intuitive goal of margin maximization into the precise mathematical objective of the SVM. Next, we'll explore the maximum margin objective in depth.

2 / 5

Loading learning content...

Machine LearningSupport Vector Machines

Maximum Margin Classifier

LevelIntermediate

Duration90 mins

TopicSupport Vector Machines

2 / 5

Functional Margin: The Raw Classification Score

From Geometric Distance to Algebraic Convenience

What You Will Learn

Definition of Functional Margin

The functional margin is simply the raw, unnormalized classification score multiplied by the true label.

Definition (Functional Margin for a Single Point):

For a data point $(\mathbf{x}_i, y_i)$ where $y_i \in {-1, +1}$, and a hyperplane defined by $(\mathbf{w}, b)$, the functional margin is:

$$\hat{\gamma}_i = y_i(\mathbf{w}^T\mathbf{x}_i + b)$$

Definition (Dataset Functional Margin):

For a dataset ${(\mathbf{x}i, y_i)}{i=1}^n$, the functional margin is:

$$\hat{\gamma} = \min_{i=1,...,n} \hat{\gamma}i = \min{i=1,...,n} y_i(\mathbf{w}^T\mathbf{x}_i + b)$$

Interpreting the Functional Margin

Breaking down the formula:

Let's understand each component:

$\mathbf{w}^T\mathbf{x}_i + b$: The raw linear classifier output
- Positive values suggest positive class
- Negative values suggest negative class
- Magnitude indicates "how far" from the boundary (in scaled units)
$y_i$: The true label, either +1 or -1
- Multiplying by $y_i$ converts the output into a "correctness score"
Product $y_i(\mathbf{w}^T\mathbf{x}_i + b)$:
- Positive if classification agrees with true label
- Negative if classification disagrees with true label
- Larger magnitude = stronger agreement/disagreement

Functional Margin Sign Analysis
True Label $y_i$	Classifier Output	Product $\hat{\gamma}_i$	Interpretation
+1	Positive (+)	Positive ✓	Correct classification
+1	Negative (-)	Negative ✗	Misclassification
-1	Positive (+)	Negative ✗	Misclassification
-1	Negative (-)	Positive ✓	Correct classification

Example calculation:

Consider a hyperplane with $\mathbf{w} = [2, 3]^T$ and $b = -1$.

For point $\mathbf{x}_1 = [1, 1]^T$ with label $y_1 = +1$: $$\hat{\gamma}_1 = (+1)(2 \cdot 1 + 3 \cdot 1 - 1) = (+1)(4) = 4$$

The functional margin is 4, indicating correct classification with some confidence.

For point $\mathbf{x}_2 = [0, 0]^T$ with label $y_2 = +1$: $$\hat{\gamma}_2 = (+1)(2 \cdot 0 + 3 \cdot 0 - 1) = (+1)(-1) = -1$$

The functional margin is -1, indicating misclassification.

Relationship Between Functional and Geometric Margins

The connection between functional and geometric margins is straightforward but critically important.

The fundamental relationship:

$$\gamma_i = \frac{\hat{\gamma}_i}{|\mathbf{w}|}$$

Or equivalently:

$$\hat{\gamma}_i = |\mathbf{w}| \cdot \gamma_i$$

Geometric margin (true distance): $$\gamma_i = y_i \cdot \frac{\mathbf{w}^T\mathbf{x}_i + b}{|\mathbf{w}|}$$

Functional margin (scaled score): $$\hat{\gamma}_i = y_i(\mathbf{w}^T\mathbf{x}_i + b) = |\mathbf{w}| \cdot \gamma_i$$

The Key Insight

Why do we need both?

Each margin concept serves a different purpose:

Aspect	Geometric Margin	Functional Margin
Interpretation	True Euclidean distance	Scaled confidence score
Units	Same as feature space	Arbitrary (depends on $\|\mathbf{w}\|$)
Scale invariance	Yes (invariant to rescaling $\mathbf{w}$)	No (changes with $\mathbf{w}$ scale)
Optimization	Division by $\|\mathbf{w}\|$ complicates	Simpler algebraic form
Generalization	Directly appears in bounds	Related via normalization

The scaling problem:

A significant issue with functional margin is its scale ambiguity. If we replace $(\mathbf{w}, b)$ with $(c\mathbf{w}, cb)$ for any $c > 0$, the functional margin becomes:

$$\hat{\gamma}'_i = y_i((c\mathbf{w})^T\mathbf{x}_i + cb) = c \cdot y_i(\mathbf{w}^T\mathbf{x}_i + b) = c \cdot \hat{\gamma}_i$$

The functional margin scales linearly with $c$! This means we can artificially inflate the functional margin by simply scaling $\mathbf{w}$ and $b$, without actually improving the classifier.

This scale invariance makes geometric margin the "true" measure of separation quality.

margin_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
import numpy as np
from typing import Tuple
 
def compare_margins(
    x: np.ndarray,
    y: int,
    w: np.ndarray,
    b: float,
    scale_factors: list = [1, 2, 5, 10]
) -> None:
    """
    Demonstrate the different behaviors of functional and geometric margins
    under rescaling of the hyperplane parameters.
    
    Parameters:
    -----------
    x : np.ndarray
        A single data point
    y : int
        True label (-1 or +1)
    w : np.ndarray
        Weight vector
    b : float
        Bias term
    scale_factors : list
        Different scaling factors to apply
    """
    print("Margin Behavior Under Rescaling")
    print("=" * 70)
    print(f"Point: {x}")
    print(f"True label: {y}")
    print(f"Original w: {w}, b: {b}")
    print()
    
    print(f"{'Scale c':<10} {'||cw||':<12} {'Functional':<15} {'Geometric':<15}")
    print("-" * 55)
    
    for c in scale_factors:
        w_scaled = c * w
        b_scaled = c * b
        
        w_norm = np.linalg.norm(w_scaled)
        
        # Functional margin
        functional_margin = y * (np.dot(w_scaled, x) + b_scaled)
        
        # Geometric margin
        geometric_margin = functional_margin / w_norm
        
        print(f"{c:<10} {w_norm:<12.4f} {functional_margin:<15.4f} {geometric_margin:<15.4f}")
    
    print()
    print("Observation: Functional margin scales with c, geometric margin stays constant!")
 
 
def margin_relationship_demo():
    """
    Demonstrate that geometric_margin = functional_margin / ||w||
    """
    # Example hyperplane
    w = np.array([3.0, 4.0])  # ||w|| = 5
    b = -2.0
    
    # Example points
    points = [
        (np.array([2.0, 1.0]), +1, "Inside positive region"),
        (np.array([0.0, 0.0]), +1, "Near boundary, positive"),
        (np.array([-1.0, 0.0]), -1, "Inside negative region"),
        (np.array([0.5, 0.5]), +1, "Close to boundary"),
    ]
    
    w_norm = np.linalg.norm(w)
    
    print("\nMargin Relationship Verification")
    print("=" * 80)
    print(f"Hyperplane: {w[0]}x + {w[1]}y + {b} = 0")
    print(f"||w|| = {w_norm}")
    print()
    
    print(f"{'Description':<25} {'Functional':<12} {'÷ ||w||':<12} {'= Geometric':<12}")
    print("-" * 65)
    
    for x, y_true, desc in points:
        functional = y_true * (np.dot(w, x) + b)
        geometric = functional / w_norm
        
        print(f"{desc:<25} {functional:<12.4f} {'/ ' + str(w_norm):<12} {geometric:<12.4f}")
    
    print()
    print("Verified: γ_geometric = γ_functional / ||w||")
 
 
if __name__ == "__main__":
    # Demo 1: Scale invariance of geometric margin
    x = np.array([1.0, 1.0])
    y = +1
    w = np.array([1.0, 2.0])
    b = -1.0
    
    compare_margins(x, y, w, b)
    
    # Demo 2: Relationship between margins
    margin_relationship_demo()

The Canonical Scaling Convention

The scale ambiguity of functional margins poses a challenge: if we can arbitrarily inflate functional margins without improving the classifier, how do we formulate a meaningful optimization problem?

The solution is elegant: we impose a canonical scaling constraint.

The Canonical Constraint:

We require that the minimum functional margin equals 1:

$$\min_{i=1,...,n} y_i(\mathbf{w}^T\mathbf{x}_i + b) = 1$$

Equivalently, for all training points: $$y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1 \quad \forall i = 1, ..., n$$

with equality for at least one point.

Why This Works

Proof that this is achievable:

Suppose we have a separating hyperplane $(\mathbf{w}, b)$ with positive minimum functional margin $\hat{\gamma} = \min_i y_i(\mathbf{w}^T\mathbf{x}_i + b) > 0$.

We can rescale by $c = 1/\hat{\gamma}$:

New weights: $\mathbf{w}' = \mathbf{w}/\hat{\gamma}$
New bias: $b' = b/\hat{\gamma}$

Now the minimum functional margin becomes: $$\min_i y_i(\mathbf{w}'^T\mathbf{x}_i + b') = \min_i \frac{y_i(\mathbf{w}^T\mathbf{x}_i + b)}{\hat{\gamma}} = \frac{\hat{\gamma}}{\hat{\gamma}} = 1$$

Connecting back to geometric margin:

With the canonical scaling $\hat{\gamma} = 1$, the geometric margin becomes: $$\gamma = \frac{\hat{\gamma}}{|\mathbf{w}|} = \frac{1}{|\mathbf{w}|}$$

This is a crucial result! Under canonical scaling:

$$\boxed{\gamma = \frac{1}{|\mathbf{w}|}}$$

The Magic of Canonical Scaling

Geometric interpretation of canonical scaling:

The constraint $y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1$ defines two "margin hyperplanes":

Positive margin hyperplane: $\mathbf{w}^T\mathbf{x} + b = +1$
Negative margin hyperplane: $\mathbf{w}^T\mathbf{x} + b = -1$

These are parallel to the decision boundary $\mathbf{w}^T\mathbf{x} + b = 0$ and are equidistant from it.

The constraints ensure:

All positive examples satisfy $\mathbf{w}^T\mathbf{x} + b \geq +1$ (on or beyond positive margin hyperplane)
All negative examples satisfy $\mathbf{w}^T\mathbf{x} + b \leq -1$ (on or beyond negative margin hyperplane)

The "margin corridor" or "street" is the region between these hyperplanes: $$-1 \leq \mathbf{w}^T\mathbf{x} + b \leq +1$$

The width of this corridor is: $$\text{width} = \frac{2}{|\mathbf{w}|}$$

Maximizing the margin width = maximizing $1/|\mathbf{w}|$ = minimizing $|\mathbf{w}|$.

Canonical Scaling Summary
Concept	Expression	Interpretation
Canonical constraint	$\min_i \hat{\gamma}_i = 1$	Fixes scale of parameters
Margin hyperplanes	$\mathbf{w}^T\mathbf{x} + b = \pm 1$	Boundaries of margin corridor
Corridor width	$2/\|\mathbf{w}\|$	Total separation between classes
Geometric margin	$1/\|\mathbf{w}\|$	Half-width = distance to boundary
Optimization target	Minimize $\|\mathbf{w}\|^2$	Maximize margin (equivalent)

From Margins to the Optimization Problem

Now we can see how the two margin concepts lead naturally to the SVM optimization problem.

The maximum margin objective (intuitive form):

We want to find the hyperplane that maximizes the geometric margin:

$$\max_{\mathbf{w}, b} \gamma \quad \text{subject to} \quad y_i \cdot \frac{\mathbf{w}^T\mathbf{x}_i + b}{|\mathbf{w}|} \geq \gamma \quad \forall i$$

This says: maximize $\gamma$ while ensuring all points are at least distance $\gamma$ from the boundary.

Conversion to canonical form:

Using the relationship $\gamma = \hat{\gamma}/|\mathbf{w}|$ and imposing canonical scaling $\hat{\gamma} = 1$:

$$\max_{\mathbf{w}, b} \frac{1}{|\mathbf{w}|} \quad \text{subject to} \quad y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1 \quad \forall i$$

Final optimization form:

Maximizing $1/|\mathbf{w}|$ is equivalent to minimizing $|\mathbf{w}|$, which is equivalent to minimizing $\frac{1}{2}|\mathbf{w}|^2$ (the factor of 1/2 simplifies derivatives):

$$\min_{\mathbf{w}, b} \frac{1}{2}|\mathbf{w}|^2 \quad \text{subject to} \quad y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1 \quad \forall i$$

The Hard-Margin SVM Primal Problem

Understanding the formulation:

Objective: $\frac{1}{2}|\mathbf{w}|^2$
- Minimizing this maximizes $\gamma = 1/|\mathbf{w}|$
- The $\frac{1}{2}$ factor is conventional (simplifies gradients)
- Quadratic = convex = unique solution exists
Constraints: $y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1$
- Ensures canonical scaling (minimum functional margin = 1)
- Ensures all points are correctly classified
- Linear in $\mathbf{w}$ and $b$ = convex feasible region
Why this works:
- Constraints fix the scale → remove ambiguity
- Objective directly measures inverse of geometric margin
- Convex program → efficient algorithms exist

margin_optimization_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import minimize
from typing import Tuple, Optional
 
def hard_margin_svm_primal(
    X: np.ndarray,
    y: np.ndarray
) -> Tuple[np.ndarray, float, float]:
    """
    Solve the hard-margin SVM primal problem directly.
    
    min_{w,b} (1/2) ||w||^2
    s.t.      y_i(w^T x_i + b) >= 1 for all i
    
    Parameters:
    -----------
    X : np.ndarray
        Feature matrix (n_samples, n_features)
    y : np.ndarray
        Labels in {-1, +1}
    
    Returns:
    --------
    Tuple[np.ndarray, float, float]
        (optimal w, optimal b, optimal margin)
    """
    n_samples, n_features = X.shape
    
    # Decision variables: [w_1, w_2, ..., w_d, b]
    n_vars = n_features + 1
    
    def objective(params):
        w = params[:n_features]
        return 0.5 * np.dot(w, w)
    
    def objective_grad(params):
        grad = np.zeros(n_vars)
        grad[:n_features] = params[:n_features]  # Gradient w.r.t. w
        return grad
    
    # Inequality constraints: y_i(w^T x_i + b) >= 1
    # scipy minimize uses <= 0, so we write: 1 - y_i(w^T x_i + b) <= 0
    constraints = []
    
    for i in range(n_samples):
        def constraint(params, i=i):
            w = params[:n_features]
            b = params[n_features]
            return 1 - y[i] * (np.dot(w, X[i]) + b)
        
        constraints.append({
            'type': 'ineq',
            'fun': lambda params, i=i: -constraint(params, i)  # >= 0
        })
    
    # Initial guess
    x0 = np.zeros(n_vars)
    
    # Solve using SLSQP
    result = minimize(
        objective,
        x0,
        method='SLSQP',
        jac=objective_grad,
        constraints=constraints,
        options={'ftol': 1e-10, 'maxiter': 1000}
    )
    
    w_opt = result.x[:n_features]
    b_opt = result.x[n_features]
    w_norm = np.linalg.norm(w_opt)
    margin = 1.0 / w_norm if w_norm > 0 else float('inf')
    
    return w_opt, b_opt, margin
 
 
def verify_constraints_and_margins(
    X: np.ndarray,
    y: np.ndarray,
    w: np.ndarray,
    b: float
) -> dict:
    """
    Verify that the SVM solution satisfies all constraints
    and compute margin statistics.
    """
    functional_margins = y * (X @ w + b)
    w_norm = np.linalg.norm(w)
    geometric_margins = functional_margins / w_norm
    
    min_functional = np.min(functional_margins)
    min_geometric = np.min(geometric_margins)
    
    all_satisfied = np.all(functional_margins >= 1 - 1e-6)
    
    # Identify support vectors (points with margin close to 1)
    support_mask = np.abs(functional_margins - 1.0) < 0.01
    
    return {
        'functional_margins': functional_margins,
        'geometric_margins': geometric_margins,
        'min_functional': min_functional,
        'min_geometric': min_geometric,
        'predicted_margin': 1.0 / w_norm,
        'all_constraints_satisfied': all_satisfied,
        'support_vector_indices': np.where(support_mask)[0],
        'n_support_vectors': np.sum(support_mask),
    }
 
 
# Example demonstration
if __name__ == "__main__":
    # Create linearly separable data
    np.random.seed(42)
    
    # Positive class: centered at (2, 2)
    X_pos = np.random.randn(10, 2) * 0.5 + np.array([2, 2])
    y_pos = np.ones(10)
    
    # Negative class: centered at (-2, -2)
    X_neg = np.random.randn(10, 2) * 0.5 + np.array([-2, -2])
    y_neg = -np.ones(10)
    
    X = np.vstack([X_pos, X_neg])
    y = np.hstack([y_pos, y_neg])
    
    # Solve hard-margin SVM
    w_opt, b_opt, margin = hard_margin_svm_primal(X, y)
    
    print("Hard-Margin SVM Solution")
    print("=" * 50)
    print(f"Optimal w: {w_opt}")
    print(f"Optimal b: {b_opt}")
    print(f"||w||: {np.linalg.norm(w_opt):.6f}")
    print(f"Geometric margin (1/||w||): {margin:.6f}")
    print()
    
    # Verify solution
    verification = verify_constraints_and_margins(X, y, w_opt, b_opt)
    
    print("Constraint Verification")
    print("-" * 50)
    print(f"All constraints satisfied: {verification['all_constraints_satisfied']}")
    print(f"Min functional margin: {verification['min_functional']:.6f} (should be ~1)")
    print(f"Min geometric margin: {verification['min_geometric']:.6f}")
    print(f"Predicted margin (1/||w||): {verification['predicted_margin']:.6f}")
    print(f"Number of support vectors: {verification['n_support_vectors']}")
    print(f"Support vector indices: {verification['support_vector_indices']}")

Why Do We Need Both Margin Concepts?

At this point, you might wonder: if geometric margin is the "true" margin, why bother with functional margin at all? The answer reveals beautiful mathematical convenience that makes SVM tractable.

Geometric Margin Advantages

•True distance: Represents actual Euclidean distance in feature space
•Scale invariant: Unchanged by rescaling of (w, b)
•Generalization: Directly appears in PAC-learning bounds
•Interpretable: Easy to visualize and explain
•Comparable: Can compare margins across different hyperplanes

Functional Margin Advantages

•Algebraically simple: No division in computation
•Linear constraints: y_i(w^Tx_i + b) ≥ 1 is linear in (w, b)
•Convex program: Leads to efficient optimization
•Canonical scaling: Fixes scale via min margin = 1
•Gradient-friendly: Easy to compute gradients for optimization

The Synergy

The optimization formulation journey:

Start with geometric margin objective: $$\max_{\mathbf{w},b} \min_i y_i \frac{\mathbf{w}^T\mathbf{x}_i + b}{|\mathbf{w}|}$$
- Non-convex due to division by $|\mathbf{w}|$
- Hard to optimize directly
Introduce functional margin:
- Define $\hat{\gamma}_i = y_i(\mathbf{w}^T\mathbf{x}_i + b)$
- Recognize: $\gamma_i = \hat{\gamma}_i / |\mathbf{w}|$
Apply canonical scaling:
- Rescale so $\min_i \hat{\gamma}_i = 1$
- Now: $\gamma = 1/|\mathbf{w}|$
- Constraint: $\hat{\gamma}_i \geq 1$ for all $i$
Reformulate as minimization: $$\min_{\mathbf{w},b} \frac{1}{2}|\mathbf{w}|^2 \quad \text{s.t.} \quad y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1$$
- Convex quadratic objective
- Linear constraints
- Unique global solution exists!

Summary: The functional margin is a computational convenience that, when combined with canonical scaling, transforms a geometrically motivated but hard problem into an elegant convex program.

Preview: Functional Margin in the Dual Formulation

While we'll cover the dual formulation in detail later, it's worth previewing how functional margins appear there, as this connection is fundamental to understanding SVMs.

The Lagrangian approach:

Starting from the primal: $$\min_{\mathbf{w},b} \frac{1}{2}|\mathbf{w}|^2 \quad \text{s.t.} \quad y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1 \quad \forall i$$

We introduce Lagrange multipliers $\alpha_i \geq 0$ for each constraint:

$$\mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha}) = \frac{1}{2}|\mathbf{w}|^2 - \sum_{i=1}^n \alpha_i \left[ y_i(\mathbf{w}^T\mathbf{x}_i + b) - 1 \right]$$

Notice how the functional margin $y_i(\mathbf{w}^T\mathbf{x}_i + b)$ appears directly!

The KKT conditions reveal:

For the optimal solution, the KKT complementary slackness condition states:

$$\alpha_i \left[ y_i(\mathbf{w}^T\mathbf{x}_i + b) - 1 \right] = 0 \quad \forall i$$

This means:

If $\alpha_i > 0$, then $y_i(\mathbf{w}^T\mathbf{x}_i + b) = 1$ (point is on the margin)
If $y_i(\mathbf{w}^T\mathbf{x}_i + b) > 1$, then $\alpha_i = 0$ (point is beyond the margin)

Support Vectors Defined

Connection to geometric margin:

For a support vector $\mathbf{x}_i$:

Functional margin: $\hat{\gamma}_i = y_i(\mathbf{w}^T\mathbf{x}_i + b) = 1$
Geometric margin: $\gamma_i = 1/|\mathbf{w}|$

All support vectors are equidistant from the decision boundary—they define the "width" of the margin corridor. This is why the margin can be computed from any support vector.

The dual problem (preview):

Eliminating $\mathbf{w}$ and $b$ from the Lagrangian leads to the dual:

Summary: Role of Functional Margin in SVM Development
Stage	Role of Functional Margin	Key Insight
Problem formulation	Defines canonical scaling	Removes scale ambiguity
Primal constraints	Linear inequalities $y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1$	Enables convex QP
Lagrangian	Appears in constraint terms	Couples $\alpha_i$ to margin violations
KKT conditions	Support vectors have margin = 1	Defines sparse solution
Dual formulation	Implicitly encodes margin structure	Enables kernel trick

Common Misconceptions About Margins

Even experienced practitioners sometimes confuse the two margin concepts or misunderstand their roles. Let's clarify common misconceptions.

Misconception #1: "Functional margin IS the margin"

•The mistake: Treating functional margin as the measure of separation quality
•The problem: Functional margin can be arbitrarily increased by scaling (w, b)
•The truth: Geometric margin is the true distance; functional margin is a computational convenience
•How to recognize: If changing the scale of w changes your "margin," you're using functional margin

Misconception #2: "We minimize ||w||² because it's a regularizer"

•The partial truth: Minimizing ||w||² does have a regularizing effect
•The deeper story: In hard-margin SVM, minimizing ||w||² is maximizing the geometric margin
•The connection: Under canonical scaling, margin = 1/||w||, so min ||w|| = max margin
•In soft-margin: Then it's both margin maximization AND regularization

Misconception #3: "All points contribute equally to the margin"

•The mistake: Thinking every training point matters for the solution
•The truth: Only support vectors (points with functional margin = 1) determine the hyperplane
•The implication: You could remove all non-support vectors without changing the solution
•Why this matters: Explains SVM's sparsity and efficiency

The Normalization Trap

Misconception #4: "Larger functional margin always means better classifier"

•The mistake: Using functional margin to compare classifiers
•Counter-example: Scale w by 1000 → functional margins increase 1000x, but classifier is identical
•The fix: Compare geometric margins, or ensure same canonical scaling
•In practice: Most libraries (sklearn, etc.) return raw decision values, not normalized margins

Summary: Functional Margin and Its Role

We've established the functional margin as the computational companion to the geometric margin. Let's consolidate the key insights:

Key Takeaways

•Functional margin is the unnormalized score: $\hat{\gamma}_i = y_i(\mathbf{w}^T\mathbf{x}_i + b)$, without division by $|\mathbf{w}|$
•The relationship is simple: $\gamma_i = \hat{\gamma}_i / |\mathbf{w}|$, where $\gamma_i$ is geometric margin
•Functional margin has scale ambiguity: Can be artificially inflated by scaling $(\mathbf{w}, b)$
•Canonical scaling fixes the scale: Require $\min_i \hat{\gamma}_i = 1$, then $\gamma = 1/|\mathbf{w}|$
•The optimization becomes tractable: Minimize $\frac{1}{2}|\mathbf{w}|^2$ subject to $y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1$
•Support vectors are defined by functional margin: Points with $\hat{\gamma}_i = 1$ are the support vectors

What's next:

Page Complete

2 / 5