Maximum Margin Classifier - Learning Module

Loading content...

0/278

Support Vectors: The Critical Points

The Points That Define Everything

Among all the training data points in a classification problem, only a handful truly matter for the final SVM classifier. These privileged points—called support vectors—are the sole determinants of the decision boundary. Every other training point could be modified, moved, or even removed without affecting the optimal hyperplane.

This remarkable property is central to what makes SVMs unique: they identify the critical "frontier" points that define the boundary between classes. Understanding support vectors is essential for interpreting SVM results, understanding their computational efficiency, and appreciating their theoretical elegance.

What You Will Learn

By the end of this page, you will understand: (1) The precise definition of support vectors from KKT conditions, (2) How to identify support vectors geometrically and algebraically, (3) Why support vectors completely determine the optimal hyperplane, (4) The sparsity properties and computational implications, and (5) How support vectors relate to classifier interpretation and robustness.

Formal Definition of Support Vectors

Support vectors are defined through the Karush-Kuhn-Tucker (KKT) conditions of the SVM optimization problem. Let's develop this definition rigorously.

Recall the primal problem:

$$\min_{\mathbf{w}, b} \frac{1}{2}|\mathbf{w}|^2$$ $$\text{s.t.} \quad y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1, \quad i = 1, ..., n$$

The Lagrangian:

Introducing Lagrange multipliers $\alpha_i \geq 0$ for each constraint:

$$\mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha}) = \frac{1}{2}|\mathbf{w}|^2 - \sum_{i=1}^n \alpha_i \left[ y_i(\mathbf{w}^T\mathbf{x}_i + b) - 1 \right]$$

The KKT conditions:

At the optimal solution $(\mathbf{w}^, b^, \boldsymbol{\alpha}^*)$:

Stationarity: $ abla_{\mathbf{w}} \mathcal{L} = 0 \Rightarrow \mathbf{w}^* = \sum_{i=1}^n \alpha_i^* y_i \mathbf{x}_i$
Stationarity (b): $\frac{\partial \mathcal{L}}{\partial b} = 0 \Rightarrow \sum_{i=1}^n \alpha_i^* y_i = 0$
Primal feasibility: $y_i(\mathbf{w}^{T}\mathbf{x}_i + b^) \geq 1$
Dual feasibility: $\alpha_i^* \geq 0$
Complementary slackness: $\alpha_i^* [y_i(\mathbf{w}^{T}\mathbf{x}_i + b^) - 1] = 0$

Definition: Support Vector

A training point xᵢ is a support vector if and only if its corresponding Lagrange multiplier is positive:

$$\alpha_i^* > 0$$

Equivalently (by complementary slackness), a support vector lies exactly on the margin:

$$y_i(\mathbf{w}^{T}\mathbf{x}_i + b^) = 1$$

Understanding complementary slackness:

The condition $\alpha_i^* [y_i(\mathbf{w}^{T}\mathbf{x}_i + b^) - 1] = 0$ implies:

Either $\alpha_i^* = 0$ (the point is not a support vector)
Or $y_i(\mathbf{w}^{T}\mathbf{x}_i + b^) = 1$ (the point is exactly on the margin)

This creates a clear dichotomy:

Point Type	Lagrange Multiplier	Functional Margin	Location
Support Vector	$\alpha_i^* > 0$	Exactly 1	On margin boundary
Non-Support Vector	$\alpha_i^* = 0$	Greater than 1	Beyond margin

The "support" interpretation:

The term "support vector" comes from the fact that these points "support" or define the decision boundary—they are the structural pillars holding up the margin corridor. Remove them, and the margin would collapse (change).

Geometric Interpretation of Support Vectors

The geometric picture of support vectors is intuitive and illuminating.

The margin boundaries:

Recall the three parallel hyperplanes:

Decision boundary: $\mathbf{w}^T\mathbf{x} + b = 0$
Positive margin: $\mathbf{w}^T\mathbf{x} + b = +1$
Negative margin: $\mathbf{w}^T\mathbf{x} + b = -1$

Support vector locations:

Positive support vectors lie on the positive margin hyperplane: $\mathbf{w}^T\mathbf{x} + b = +1$
Negative support vectors lie on the negative margin hyperplane: $\mathbf{w}^T\mathbf{x} + b = -1$
All support vectors are at distance $\gamma = 1/|\mathbf{w}|$ from the decision boundary

Non-support vector locations:

All other training points lie beyond their respective margin hyperplane:

Positive non-support vectors: $\mathbf{w}^T\mathbf{x} + b > +1$
Negative non-support vectors: $\mathbf{w}^T\mathbf{x} + b < -1$

The "Pins and String" Analogy

Imagine the decision boundary as an elastic string stretched between pins. The support vectors are the pins—they're in contact with the string and determine its position. All other points are too far away to touch the string. Moving a pin changes the string's position; moving non-pins has no effect (as long as they don't become closer than the current pins).

Minimum number of support vectors:

For a non-trivial separating hyperplane in $\mathbb{R}^d$:

At least one support vector from each class is required
This is necessary to "balance" the hyperplane between the classes
In the simplest case (1D), exactly 2 support vectors suffice

However, the actual number depends on the data geometry:

Simple, well-separated classes: Few support vectors
Complex boundaries or close classes: Many support vectors
In the extreme case, all points could be support vectors

Geometric characterization:

The set of positive support vectors and negative support vectors lie on the boundaries of their respective "sides" of the optimal separation:

Find the closest positive points to the optimal hyperplane → positive SVs
Find the closest negative points to the optimal hyperplane → negative SVs
Both sets have the same distance to the hyperplane (the margin)

This is why the maximum margin hyperplane is sometimes called the "equidistant" hyperplane—it maintains equal distance to the nearest points of both classes.

Geometric Properties of Support Vectors
Property	Support Vectors	Non-Support Vectors
Distance to boundary	Exactly γ = 1/\|\|w\|\|	Greater than γ
Functional margin	Exactly 1	Greater than 1
Location	On margin hyperplane	Beyond margin hyperplane
Lagrange multiplier α	Positive (α > 0)	Zero (α = 0)
Effect on solution	Determines w, b	No effect

How Support Vectors Determine the Optimal Hyperplane

The remarkable sparsity of SVMs comes from the fact that the optimal solution is completely determined by the support vectors.

The weight vector from support vectors:

From the stationarity condition: $$\mathbf{w}^* = \sum_{i=1}^n \alpha_i^* y_i \mathbf{x}_i$$

Since $\alpha_i^* = 0$ for non-support vectors, this simplifies to: $$\mathbf{w}^* = \sum_{i \in \text{SV}} \alpha_i^* y_i \mathbf{x}_i$$

The optimal weight vector is a linear combination of support vectors only!

Computing the bias:

For any support vector $\mathbf{x}_j$ (with $\alpha_j^* > 0$), we have: $$y_j(\mathbf{w}^{T}\mathbf{x}_j + b^) = 1$$

Solving for $b^$: $$b^ = y_j - \mathbf{w}^{T}\mathbf{x}j = y_j - \sum{i \in \text{SV}} \alpha_i^ y_i (\mathbf{x}_i^T\mathbf{x}_j)$$

In practice, we average over all support vectors for numerical stability: $$b^* = \frac{1}{|\text{SV}|} \sum_{j \in \text{SV}} \left( y_j - \sum_{i \in \text{SV}} \alpha_i^* y_i (\mathbf{x}_i^T\mathbf{x}_j) \right)$$

The Representation Theorem

The optimal weight vector can be written as a sparse linear combination of training points:

$$\mathbf{w}^* = \sum_{i=1}^n \alpha_i^* y_i \mathbf{x}_i$$

This is a specific instance of the Representer Theorem from kernel methods. The sparsity (α = 0 for non-SVs) is a consequence of the hinge loss structure and the complementary slackness conditions.

Prediction using support vectors:

For a new point $\mathbf{x}{\text{new}}$, the decision function is: $$f(\mathbf{x}{\text{new}}) = \mathbf{w}^{T}\mathbf{x}_{\text{new}} + b^ = \sum_{i \in \text{SV}} \alpha_i^* y_i (\mathbf{x}i^T\mathbf{x}{\text{new}}) + b^*$$

The prediction is: $$\hat{y} = \text{sign}(f(\mathbf{x}_{\text{new}}))$$

Key observation: Prediction depends only on inner products between the test point and support vectors. This is the foundation for the kernel trick!

Independence from non-support vectors:

The following operations do not change the optimal hyperplane:

Moving non-support vectors (as long as they stay beyond the margin)
Adding new points beyond the margin
Removing non-support vectors entirely

The following operations do change the hyperplane:

Moving support vectors
Adding points inside the margin
Removing support vectors

support_vectors_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
import numpy as np
from sklearn.svm import SVC
import matplotlib.pyplot as plt
from typing import Tuple, Dict, Any
 
def analyze_support_vectors(X: np.ndarray, y: np.ndarray) -> Dict[str, Any]:
    """
    Perform comprehensive support vector analysis on the given data.
    
    Returns detailed information about support vectors, their
    contribution to the solution, and margin properties.
    """
    # Train hard-margin SVM
    svm = SVC(kernel='linear', C=1e10)
    svm.fit(X, y)
    
    # Extract parameters
    w = svm.coef_[0]
    b = svm.intercept_[0]
    w_norm = np.linalg.norm(w)
    margin = 1 / w_norm
    
    # Get support vector information
    sv_indices = svm.support_
    sv_alphas = svm.dual_coef_[0]  # alpha_i * y_i
    support_vectors = svm.support_vectors_
    
    # Compute margins for all points
    functional_margins = y * (X @ w + b)
    geometric_margins = functional_margins / w_norm
    
    # Identify support vectors by margin (should match sv_indices)
    sv_by_margin = np.where(np.abs(functional_margins - 1.0) < 0.01)[0]
    
    # Verify w is linear combination of support vectors
    w_reconstructed = np.zeros_like(w)
    for idx, sv_idx in enumerate(sv_indices):
        # sv_alphas contains alpha_i * y_i
        w_reconstructed += sv_alphas[idx] * X[sv_idx]
    
    reconstruction_error = np.linalg.norm(w - w_reconstructed)
    
    return {
        'w': w,
        'b': b,
        'margin': margin,
        'support_vector_indices': sv_indices,
        'support_vectors': support_vectors,
        'n_support_vectors': len(sv_indices),
        'n_total_samples': len(X),
        'sv_fraction': len(sv_indices) / len(X),
        'alphas_times_y': sv_alphas,
        'w_reconstruction_error': reconstruction_error,
        'functional_margins': functional_margins,
        'geometric_margins': geometric_margins,
    }
 
 
def demonstrate_sv_determines_solution(X: np.ndarray, y: np.ndarray) -> None:
    """
    Demonstrate that only support vectors determine the solution
    by removing non-support vectors and retraining.
    """
    print("=" * 70)
    print("Demonstration: Support Vectors Determine the Solution")
    print("=" * 70)
    
    # Full data solution
    analysis = analyze_support_vectors(X, y)
    
    print(f"
Full Dataset:")
    print(f"  Total samples: {analysis['n_total_samples']}")
    print(f"  Support vectors: {analysis['n_support_vectors']} ({100*analysis['sv_fraction']:.1f}%)")
    print(f"  w = {analysis['w']}")
    print(f"  b = {analysis['b']:.6f}")
    print(f"  Margin = {analysis['margin']:.6f}")
    
    # Train on support vectors only
    sv_indices = analysis['support_vector_indices']
    X_sv = X[sv_indices]
    y_sv = y[sv_indices]
    
    svm_sv = SVC(kernel='linear', C=1e10)
    svm_sv.fit(X_sv, y_sv)
    
    w_sv = svm_sv.coef_[0]
    b_sv = svm_sv.intercept_[0]
    margin_sv = 1 / np.linalg.norm(w_sv)
    
    print(f"
Support Vectors Only:")
    print(f"  Samples used: {len(X_sv)}")
    print(f"  w = {w_sv}")
    print(f"  b = {b_sv:.6f}")
    print(f"  Margin = {margin_sv:.6f}")
    
    # Compare
    w_diff = np.linalg.norm(analysis['w'] - w_sv)
    b_diff = np.abs(analysis['b'] - b_sv)
    
    print(f"
Difference (should be ~0):")
    print(f"  ||w_full - w_sv|| = {w_diff:.10f}")
    print(f"  |b_full - b_sv| = {b_diff:.10f}")
    
    # Verify w reconstruction from support vectors
    print(f"
Weight Vector Reconstruction:")
    print(f"  ||w - Σαᵢyᵢxᵢ|| = {analysis['w_reconstruction_error']:.10f}")
    print(f"  (Should be ~0 if w is truly determined by SVs)")
 
 
def visualize_support_vectors(X: np.ndarray, y: np.ndarray, 
                               figsize: Tuple[int, int] = (14, 5)) -> None:
    """
    Visualize support vectors and their role in determining the hyperplane.
    """
    analysis = analyze_support_vectors(X, y)
    
    fig, axes = plt.subplots(1, 3, figsize=figsize)
    
    w = analysis['w']
    b = analysis['b']
    margin = analysis['margin']
    sv_indices = analysis['support_vector_indices']
    
    # Common plot elements
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                          np.linspace(y_min, y_max, 200))
    
    def plot_decision_boundary(ax, w, b, title, show_sv=True):
        Z = w[0] * xx + w[1] * yy + b
        ax.contourf(xx, yy, Z, levels=[-1, 1], colors=['lightgreen'], alpha=0.3)
        ax.contour(xx, yy, Z, levels=[-1, 0, 1], colors=['red', 'black', 'blue'],
                   linestyles=['--', '-', '--'], linewidths=[1.5, 2.5, 1.5])
        
        ax.scatter(X[y == 1, 0], X[y == 1, 1], c='blue', s=60, marker='o',
                   edgecolors='black', label='Positive')
        ax.scatter(X[y == -1, 0], X[y == -1, 1], c='red', s=60, marker='s',
                   edgecolors='black', label='Negative')
        
        if show_sv:
            ax.scatter(X[sv_indices, 0], X[sv_indices, 1], facecolors='none',
                       edgecolors='gold', s=200, linewidths=3, label='Support Vectors')
        
        ax.set_xlabel('$x_1$')
        ax.set_ylabel('$x_2$')
        ax.set_title(title)
        ax.set_xlim(x_min, x_max)
        ax.set_ylim(y_min, y_max)
        ax.legend(loc='best', fontsize=8)
        ax.grid(True, alpha=0.3)
    
    # Plot 1: Full solution with SVs highlighted
    plot_decision_boundary(axes[0], w, b, 'Full Data with Support Vectors')
    
    # Plot 2: Solution trained on SVs only
    svm_sv = SVC(kernel='linear', C=1e10)
    svm_sv.fit(X[sv_indices], y[sv_indices])
    plot_decision_boundary(axes[1], svm_sv.coef_[0], svm_sv.intercept_[0], 
                           'Trained on SVs Only', show_sv=True)
    
    # Plot 3: Non-SVs removed visualization
    ax3 = axes[2]
    ax3.scatter(X[sv_indices][y[sv_indices] == 1, 0], 
                X[sv_indices][y[sv_indices] == 1, 1], 
                c='blue', s=100, marker='o', edgecolors='black')
    ax3.scatter(X[sv_indices][y[sv_indices] == -1, 0], 
                X[sv_indices][y[sv_indices] == -1, 1], 
                c='red', s=100, marker='s', edgecolors='black')
    
    Z = w[0] * xx + w[1] * yy + b
    ax3.contour(xx, yy, Z, levels=[-1, 0, 1], colors=['red', 'black', 'blue'],
               linestyles=['--', '-', '--'], linewidths=[1.5, 2.5, 1.5])
    
    ax3.set_xlabel('$x_1$')
    ax3.set_ylabel('$x_2$')
    ax3.set_title(f'Only {len(sv_indices)} SVs Needed')
    ax3.set_xlim(x_min, x_max)
    ax3.set_ylim(y_min, y_max)
    ax3.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig('support_vectors_analysis.png', dpi=150, bbox_inches='tight')
    plt.show()
 
 
# Example usage
if __name__ == "__main__":
    np.random.seed(42)
    
    # Create data with clear separation
    X_pos = np.random.randn(30, 2) * 0.7 + np.array([2.5, 2.5])
    X_neg = np.random.randn(30, 2) * 0.7 + np.array([-2.5, -2.5])
    X = np.vstack([X_pos, X_neg])
    y = np.array([1]*30 + [-1]*30)
    
    demonstrate_sv_determines_solution(X, y)
    visualize_support_vectors(X, y)

Sparsity and Computational Efficiency

The support vector sparsity has profound implications for computational efficiency and model storage.

The sparsity property:

In typical applications, only a small fraction of training points are support vectors:

Well-separated classes: Very few SVs (perhaps 1-5% of data)
Overlapping classes (with soft margin): More SVs, but still typically minority
Extreme case: All points on the margin → all are SVs (rare)

Prediction complexity comparison:

Method	Prediction Cost	Storage
Store all training data	O(n·d) per prediction	O(n·d)
SVM (store w, b)	O(d) per prediction	O(d)
SVM (store SVs, predict via inner products)	O(	SV

For linear SVM, we typically precompute $\mathbf{w}$ and predict in O(d) time. For kernel SVM, we store support vectors and predict in O(|SV|·d) time—still a significant saving when |SV| << n.

Why Sparsity Matters

Consider training on 1 million samples with 1000 features. If only 1% are support vectors:

• Full data storage: 1M × 1000 = 1B floats (~4GB) • SV storage: 10K × 1000 = 10M floats (~40MB) • 100x compression, 100x faster kernel predictions

This sparsity enables SVMs to scale to large datasets while maintaining fast inference.

Factors affecting the number of support vectors:

Class separation: Better separation → fewer SVs
Data geometry: Complex boundaries → more SVs
Noise level: More noise → more SVs needed to handle exceptions
Regularization (C parameter): Lower C → more SVs (in soft-margin)
Dimensionality: In very high dimensions, more points may be on the margin

Training vs. Prediction efficiency:

While predicting with SVMs is efficient, training can be expensive:

Standard QP: O(n³) to O(n²) complexity
SMO-based: O(n² · d) in practice, often faster

However, once trained, the SVM produces a compact model defined only by support vectors—excellent for deployment.

Memory-efficient SVMs:

For very large datasets, several strategies exploit SV sparsity during training:

Core vector machines: Approximate SVMs using estimated SVs
Budget SVMs: Limit the number of support vectors directly
Online SVMs: Add SVs incrementally, pruning non-critical ones

Typical Support Vector Fractions by Problem Type
Problem Type	Typical SV Fraction	Reason
Well-separated classes	1-5%	Large margin, few frontier points
Close but separable classes	5-15%	Narrow margin, more frontier points
Overlapping classes (soft margin)	15-40%	Many points violate or touch margin
High-dimensional text	10-40%	Many informative features
Image classification	10-30%	Complex visual patterns

Interpreting Support Vectors and Model Diagnosis

Support vectors provide unique interpretability advantages that most other classifiers lack.

What support vectors tell us:

Boundary examples: SVs are the "prototypical boundary cases"—examples that are hardest to classify or most informative for the decision.
Class confusion regions: Examining SVs reveals where classes are most similar or confused.
Data quality issues: Outliers often become SVs—spotting unusual SVs can reveal labeling errors or data anomalies.
Feature importance (indirect): Features that vary most among SVs are often most discriminative.

Diagnostic use cases:

Support Vector Diagnostics

•Too many SVs (e.g., >50%): Classes overlap significantly; consider more features, different kernel, or accept inherent ambiguity
•Very few SVs: Classes are well-separated; solution is highly robust, consider simpler model
•Unbalanced SVs: More SVs from one class may indicate class imbalance or asymmetric complexity
•Outlier SVs: Individual points far from class center that become SVs may be mislabeled
•High α values: SVs with large multipliers are especially critical; removing them significantly changes the boundary

The Outlier Warning

In hard-margin SVM, a single outlier on the wrong side makes the problem infeasible. Outliers near the boundary often become support vectors with high influence. Always inspect support vectors for anomalous training examples—they might be labeling errors!

Interpreting α values:

The Lagrange multiplier $\alpha_i$ indicates how "important" a support vector is:

Larger $\alpha_i$ → point has more influence on $\mathbf{w}$
For hard-margin SVM, there's no upper bound on $\alpha_i$
For soft-margin SVM with parameter $C$: $0 < \alpha_i \leq C$

Points with $\alpha_i$ close to the upper bound (in soft-margin) are often on the wrong side of the margin—these are the "difficult" examples.

Example: Diagnosing an SVM model:

Model Statistics:
- 1000 training samples
- 47 support vectors (4.7%)
- 25 from class +1, 22 from class -1
- Max α: 12.3, Mean α: 1.8

Interpretation:
✓ Low SV fraction → well-separated classes
✓ Balanced SVs → symmetric complexity
? High max α → check if that point is an outlier

SVs as representative examples:

In applications like image classification or NLP, you can visualize support vectors to understand what the model considers "borderline":

In spam detection: SVs are emails that look somewhat like spam and somewhat legitimate
In image classification: SVs are images with ambiguous features
In medical diagnosis: SVs are patients with borderline symptoms

Preview: Support Vectors in Soft-Margin SVM

While this module focuses on hard-margin SVM, it's important to preview how support vectors behave in the more practical soft-margin setting.

Soft-margin introduces slack variables:

The soft-margin problem allows constraint violations via slack variables $\xi_i$:

$$\min_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2}|\mathbf{w}|^2 + C\sum_{i=1}^n \xi_i$$ $$\text{s.t.} \quad y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0$$

Three types of points emerge:

Non-support vectors ($\alpha_i = 0$):
- Functional margin > 1
- Correctly classified, beyond the margin
- Don't influence the solution
Margin support vectors ($0 < \alpha_i < C$):
- Functional margin = 1 exactly
- Correctly classified, on the margin
- Like hard-margin SVs
Bounded support vectors ($\alpha_i = C$):
- Functional margin < 1
- Either inside the margin or misclassified
- "Difficult" points that violate the margin

Support Vector Types in Soft-Margin SVM
Type	α Value	ξ Value	Location	Margin
Non-SV	0	0	Beyond margin	1
Margin SV	(0, C)	0	On margin	= 1
Bounded SV (inside margin)	C	(0, 1)	Inside margin	(0, 1)
Bounded SV (misclassified)	C	1	Wrong side	< 0

The Role of C

The parameter C controls the trade-off:

Large C → Penalize violations heavily → Fewer bounded SVs, solution closer to hard margin Small C → Allow more violations → More bounded SVs, wider margin, potentially more errors

In the limit C → ∞, soft-margin SVM approaches hard-margin SVM.

Why this matters:

In practice, almost all SVM applications use soft-margin:

Real data is rarely perfectly separable
Soft margin provides regularization and robustness
C serves as a hyperparameter for model complexity

Understanding the three types of points helps interpret soft-margin SVM:

Many bounded SVs suggests significant class overlap
The ratio of margin SVs to bounded SVs indicates separation quality
Bounded SVs are candidates for closer inspection (potential outliers or hard examples)

Computing b in soft-margin:

For computing $b^*$, we use margin SVs (not bounded SVs):

Only margin SVs satisfy $y_i(\mathbf{w}^T\mathbf{x}_i + b) = 1$ exactly
Average over all margin SVs for numerical stability
Bounded SVs have $y_i(\mathbf{w}^T\mathbf{x}_i + b) < 1$, so they can't be used directly

Advanced Mathematical Properties

For those seeking deeper mathematical understanding, support vectors have several elegant properties.

Property 1: Support Vector Optimality

A set of points $S \subset {1, ..., n}$ could be support vectors for the optimal solution if and only if the SVM trained on just $S$ yields the same margin as the SVM trained on all data.

Property 2: Minimum Support Vector Set

There exists a minimal set of support vectors with the following properties:

Removing any point from the set changes the optimal hyperplane
For data in general position in $\mathbb{R}^d$, expect at most $d + 1$ SVs
Degenerate configurations can have more SVs

Property 3: Connection to Convex Hulls

Support vectors lie on the boundaries of the convex hulls of each class:

If $C_+$ is the convex hull of positive examples
And $C_-$ is the convex hull of negative examples
Then SVs are the points defining the closest facets of $C_+$ and $C_-$

The margin equals half the distance between $C_+$ and $C_-$.

The Nearest Points Theorem

Let p⁺ be the point in the convex hull of positive examples closest to the convex hull of negative examples, and p⁻ be the analogous point for negatives. The optimal SVM hyperplane is perpendicular to (p⁺ - p⁻) and bisects the segment connecting them. The support vectors are exactly the examples whose convex combination gives p⁺ and p⁻.

Property 4: Stability Characterization

The optimal solution is "stable" with respect to non-support vectors:

Let $(\mathbf{w}^, b^)$ be optimal for data ${(\mathbf{x}i, y_i)}{i=1}^n$
Let $j$ be a non-support vector ($\alpha_j^* = 0$)
Then $(\mathbf{w}^, b^)$ remains optimal after any modification to $\mathbf{x}_j$ that keeps it beyond the margin

Property 5: Sensitivity Analysis

The margin and support vectors are sensitive to:

Perturbations of support vectors: First-order changes in margin
Addition of points inside the margin: Margin shrinks, new SVs may appear
Removal of support vectors: Margin may increase, other points become SVs

Property 6: Dimensionality and SVs

In $\mathbb{R}^d$ with data in general position:

At least 2 SVs required (one per class)
At most $d + 1$ SVs typically observed
More SVs indicate degeneracy or data on common hyperplanes

Property 7: SV Influence Functions

The influence of removing support vector $i$ on the margin can be approximated: $$\Delta \gamma \approx \frac{\alpha_i^}{|\mathbf{w}^|^3}$$

Support vectors with larger $\alpha_i^*$ have more influence on the margin.

Summary: The Power of Support Vectors

Support vectors are the cornerstone of SVM's elegance and efficiency. Let's consolidate the key insights:

Key Takeaways

•Definition: Support vectors are points with $\alpha_i^* > 0$, equivalently those with functional margin exactly 1 (on the margin boundary)
•Determination: The optimal hyperplane is completely determined by support vectors: $\mathbf{w}^* = \sum_{i \in SV} \alpha_i^* y_i \mathbf{x}_i$
•Sparsity: Typically only a small fraction of training points are SVs, enabling efficient storage and prediction
•Geometric meaning: SVs are the "frontier" points—closest to the decision boundary, defining the margin
•Non-SV irrelevance: Removing non-support vectors doesn't change the solution—remarkable stability property
•Interpretability: SVs reveal boundary cases, potential outliers, and regions of class confusion

What's next:

We've established that support vectors define the maximum margin hyperplane. But is this hyperplane unique? In the final page of this module, we'll prove the uniqueness of the maximum margin solution and understand why the SVM optimization always converges to the same answer.

Page Complete

You now understand support vectors—the critical points that define the SVM solution. Their sparsity makes SVMs efficient; their location on the margin boundary makes them interpretable; and their sole determination of the hyperplane makes SVMs uniquely elegant among classifiers. Next, we'll prove the uniqueness of this solution.