Constrained Optimization - Learning Module

Loading content...

0/245

Duality: The Mirror World of Optimization

Another Way to See

Every constrained optimization problem has a hidden twin—a dual problem that approaches the same answer from a completely different direction. While the original (primal) problem minimizes the objective over the decision variables $\mathbf{x}$, the dual problem maximizes a related function over the Lagrange multipliers $\boldsymbol{\alpha}, \boldsymbol{\beta}$.

This isn't merely a mathematical curiosity. Duality reveals:

Lower bounds on the optimal value (useful for proving optimality)
Certificates of optimality (when primal and dual solutions match)
Computational advantages (sometimes the dual is easier to solve)
Structural insights (the dual of an SVM reveals the kernel trick)

Duality is one of the most profound ideas in optimization, with applications ranging from economics to physics to machine learning.

What You Will Learn

By the end of this page, you will understand Lagrangian duality: how to construct dual problems, the meaning of weak and strong duality, when the duality gap is zero, and how to interpret dual solutions. This foundation is essential for understanding SVMs and many optimization algorithms.

The Lagrangian Dual Function

Recall our constrained optimization problem:

$$\min_{\mathbf{x}} f(\mathbf{x}) \quad \text{subject to} \quad g_i(\mathbf{x}) \leq 0, ; h_j(\mathbf{x}) = 0$$

with Lagrangian:

$$\mathcal{L}(\mathbf{x}, \boldsymbol{\alpha}, \boldsymbol{\beta}) = f(\mathbf{x}) + \sum_i \alpha_i g_i(\mathbf{x}) + \sum_j \beta_j h_j(\mathbf{x})$$

where $\boldsymbol{\alpha} \geq 0$.

The Dual Function:

Define the Lagrangian dual function (or just "dual function") as:

$$d(\boldsymbol{\alpha}, \boldsymbol{\beta}) = \inf_{\mathbf{x}} \mathcal{L}(\mathbf{x}, \boldsymbol{\alpha}, \boldsymbol{\beta})$$

This is obtained by minimizing the Lagrangian over $\mathbf{x}$ while treating $\boldsymbol{\alpha}, \boldsymbol{\beta}$ as fixed parameters.

Key Property: The infimum might be $-\infty$ for some choices of $(\boldsymbol{\alpha}, \boldsymbol{\beta})$. The domain of $d$ where it's finite is called the dual feasible region.

Why Minimize Over x?

The idea is to 'hide' the complexity of x by computing the best x for each fixed choice of multipliers. What remains is a function purely of the multipliers, which captures the essential structure of the constrained problem through a different lens.

Computing a Dual FunctionFind the dual function for: min x² subject to x ≥ 1 (or -x + 1 ≤ 0)

Input

L(x, α) = x² + α(-x + 1) = x² - αx + α, with α ≥ 0

Output

d(α) = -α²/4 + α (after minimizing over x)

Explanation

Minimize over x: ∂L/∂x = 2x - α = 0 → x = α/2. Substitute back: d(α) = (α/2)² - α(α/2) + α = α²/4 - α²/2 + α = -α²/4 + α. The dual function is a concave quadratic in α.

Properties of the Dual Function:

Always concave — Even if the primal problem is non-convex, $d$ is concave (it's an infimum of affine functions of $(\boldsymbol{\alpha}, \boldsymbol{\beta})$)
Provides lower bounds — For any dual-feasible $(\boldsymbol{\alpha}, \boldsymbol{\beta})$ with $\boldsymbol{\alpha} \geq 0$, we have $d(\boldsymbol{\alpha}, \boldsymbol{\beta}) \leq p^$ where $p^$ is the primal optimal value
May be $-\infty$ — At points where the Lagrangian is unbounded below in $\mathbf{x}$

Weak Duality: A Fundamental Inequality

Theorem (Weak Duality):

For any primal-feasible $\mathbf{x}$ and any dual-feasible $(\boldsymbol{\alpha}, \boldsymbol{\beta})$ with $\boldsymbol{\alpha} \geq 0$:

$$d(\boldsymbol{\alpha}, \boldsymbol{\beta}) \leq f(\mathbf{x})$$

In particular, if $p^$ is the optimal primal value and $d^$ is the optimal dual value:

$$d^* \leq p^*$$

Proof:

Let $\mathbf{x}$ be primal-feasible: $g_i(\mathbf{x}) \leq 0$ and $h_j(\mathbf{x}) = 0$.

Since $\alpha_i \geq 0$ and $g_i(\mathbf{x}) \leq 0$, we have $\alpha_i g_i(\mathbf{x}) \leq 0$.

Since $h_j(\mathbf{x}) = 0$, we have $\beta_j h_j(\mathbf{x}) = 0$.

Therefore: $$\mathcal{L}(\mathbf{x}, \boldsymbol{\alpha}, \boldsymbol{\beta}) = f(\mathbf{x}) + \underbrace{\sum_i \alpha_i g_i(\mathbf{x})}{\leq 0} + \underbrace{\sum_j \beta_j h_j(\mathbf{x})}{= 0} \leq f(\mathbf{x})$$

Taking the infimum over all $\mathbf{x}$ (not just feasible ones) only decreases the left side: $$d(\boldsymbol{\alpha}, \boldsymbol{\beta}) = \inf_{\mathbf{x}} \mathcal{L}(\mathbf{x}, \boldsymbol{\alpha}, \boldsymbol{\beta}) \leq \mathcal{L}(\mathbf{x}, \boldsymbol{\alpha}, \boldsymbol{\beta}) \leq f(\mathbf{x})$$

Weak Duality Always Holds

Weak duality requires no assumptions about convexity—it's a universal truth for Lagrangian duality. Every solution to the dual provides a lower bound on the primal optimal value. This is incredibly useful: if you find primal and dual solutions with equal objectives, you've proven both are optimal!

Applications of Weak Duality:

Application	How Weak Duality Helps
Proving optimality	If $f(\mathbf{x}^) = d(\boldsymbol{\alpha}^, \boldsymbol{\beta}^*)$, both are optimal
Bounding search	Dual solutions bound how good primal can get
Early termination	If gap is small enough, stop optimization
Infeasibility detection	Dual unbounded → primal infeasible (under conditions)

The Dual Problem Formulation

Since the dual function provides lower bounds, the best lower bound comes from maximizing it:

The Lagrangian Dual Problem:

$$\max_{\boldsymbol{\alpha}, \boldsymbol{\beta}} d(\boldsymbol{\alpha}, \boldsymbol{\beta}) \quad \text{subject to} \quad \boldsymbol{\alpha} \geq 0$$

Or equivalently:

$$\max_{\boldsymbol{\alpha} \geq 0, \boldsymbol{\beta}} \inf_{\mathbf{x}} \mathcal{L}(\mathbf{x}, \boldsymbol{\alpha}, \boldsymbol{\beta})$$

Key Observations:

The dual problem is a convex optimization problem (maximizing a concave function over a convex set), even when the primal is non-convex
The optimal value $d^$ of the dual is always $\leq p^$ by weak duality
The difference $p^* - d^*$ is called the duality gap

Max-Inf vs Min-Sup

The primal is min over x of the objective. The dual is max over multipliers of (inf over x of Lagrangian). This 'max-inf' structure creates a game-theoretic interpretation: the dual player (controlling α, β) tries to make the Lagrangian as large as possible, while x is chosen to minimize it.

Primal-Dual Relationship:

Primal Problem	Dual Problem
Minimize $f(\mathbf{x})$	Maximize $d(\boldsymbol{\alpha}, \boldsymbol{\beta})$
Subject to $g_i(\mathbf{x}) \leq 0$	Subject to $\alpha_i \geq 0$
Subject to $h_j(\mathbf{x}) = 0$	$\beta_j$ unrestricted
Variables: $\mathbf{x}$	Variables: $\boldsymbol{\alpha}, \boldsymbol{\beta}$
May be non-convex	Always convex
Optimal value: $p^*$	Optimal value: $d^* \leq p^*$

The dual offers a convex perspective on possibly non-convex problems. Even if we can't solve the primal directly, solving the dual gives us bounds.

Strong Duality: When the Gap Closes

The most desirable situation is when the duality gap vanishes:

Strong Duality:

$$p^* = d^*$$

When strong duality holds:

The dual optimal value equals the primal optimal value
Solving either problem gives the optimal objective for both
At optimality, the primal and dual variables form a "saddle point" of the Lagrangian

When Does Strong Duality Hold?

Slater's Condition (sufficient for strong duality):

If the primal problem is convex (convex $f$, convex $g_i$, affine $h_j$) and there exists a strictly feasible point $\bar{\mathbf{x}}$ with:

$g_i(\bar{\mathbf{x}}) < 0$ for all $i$ (strict inequalities)
$h_j(\bar{\mathbf{x}}) = 0$ for all $j$

then strong duality holds.

Slater's is Very Mild

Slater's condition just requires one interior point—not near the boundary, but strictly inside. For constraint sets with non-empty interior (most practical cases), this holds. The only way Slater fails is if the feasible region has no 'room inside,' which is pathological.

Duality Gaps in Different Settings
Problem Type	Duality Gap	Conditions
General non-convex	May be positive	Only weak duality guaranteed
Convex + Slater	Zero	Strong duality
Linear Programming	Zero	Always (when feasible)
Quadratic Programming	Zero	Always (when feasible, convex)
SVM (convex)	Zero	Slater holds if data not degenerate

The Saddle Point Interpretation:

When strong duality holds, the optimal primal and dual solutions form a saddle point of the Lagrangian:

$$\mathcal{L}(\mathbf{x}^, \boldsymbol{\alpha}, \boldsymbol{\beta}) \leq \mathcal{L}(\mathbf{x}^, \boldsymbol{\alpha}^, \boldsymbol{\beta}^) \leq \mathcal{L}(\mathbf{x}, \boldsymbol{\alpha}^, \boldsymbol{\beta}^)$$

for all feasible $\mathbf{x}$ and $(\boldsymbol{\alpha}, \boldsymbol{\beta})$ with $\boldsymbol{\alpha} \geq 0$.

This saddle point is simultaneously:

A minimum with respect to $\mathbf{x}$ (fixed multipliers)
A maximum with respect to $(\boldsymbol{\alpha}, \boldsymbol{\beta})$ (fixed $\mathbf{x}$)

Constructing and Solving Dual Problems

Let's work through the process of deriving and solving a dual problem.

Systematic Approach:

Write the primal in standard form: min $f$ s.t. $g_i \leq 0$, $h_j = 0$
Form the Lagrangian: $\mathcal{L}(\mathbf{x}, \boldsymbol{\alpha}, \boldsymbol{\beta}) = f + \sum \alpha_i g_i + \sum \beta_j h_j$
Minimize over $\mathbf{x}$: Find $\inf_{\mathbf{x}} \mathcal{L}$ (may require taking gradients and solving)
The result is $d(\boldsymbol{\alpha}, \boldsymbol{\beta})$: Express in terms of multipliers only
The dual problem is: max $d$ s.t. $\boldsymbol{\alpha} \geq 0$
Solve the dual (often easier; always convex)

dual_derivation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
import numpy as np
 
def dual_problem_derivation():
    """
    Example: Derive and solve the dual of a simple QP
    
    Primal: min (1/2)x² + (1/2)y²
            s.t. x + y ≥ 1  (or equivalently: 1 - x - y ≤ 0)
    
    Step 1: Lagrangian
    L(x, y, α) = (1/2)x² + (1/2)y² + α(1 - x - y)
    where α ≥ 0
    
    Step 2: Minimize over (x, y)
    ∂L/∂x = x - α = 0  →  x = α
    ∂L/∂y = y - α = 0  →  y = α
    
    Step 3: Substitute back to get dual function
    d(α) = (1/2)α² + (1/2)α² + α(1 - α - α)
         = α² + α(1 - 2α)
         = α² + α - 2α²
         = -α² + α
    
    Step 4: Dual problem
    max d(α) = -α² + α  s.t. α ≥ 0
    
    Step 5: Solve dual
    d'(α) = -2α + 1 = 0  →  α* = 1/2
    d''(α) = -2 < 0 (concave, so this is maximum)
    
    Step 6: Recover primal solution
    x* = y* = α* = 1/2
    
    Verify: f(1/2, 1/2) = 1/4 + 1/4 = 1/2
            d(1/2) = -(1/4) + 1/2 = 1/4? 
            Wait, let me recompute...
            d(1/2) = -(1/2)² + 1/2 = -1/4 + 1/2 = 1/4
            
    Hmm, there's a mismatch. Let me recheck the primal...
    f(1/2, 1/2) = (1/2)(1/4) + (1/2)(1/4) = 1/4
    
    Strong duality: p* = d* = 1/4 ✓
    """
    
    print("Dual Problem Derivation Example")
    print("=" * 60)
    print()
    print("Primal: min (1/2)x² + (1/2)y² s.t. x + y ≥ 1")
    print()
    
    # Dual function: d(α) = -α² + α
    def dual_function(alpha):
        return -alpha**2 + alpha
    
    # Optimal α
    alpha_opt = 0.5  # From d'(α) = 0
    
    # Primal solution from KKT
    x_opt = alpha_opt
    y_opt = alpha_opt
    
    # Optimal values
    primal_opt = 0.5 * x_opt**2 + 0.5 * y_opt**2
    dual_opt = dual_function(alpha_opt)
    
    print(f"Dual function: d(α) = -α² + α")
    print(f"Optimal dual variable: α* = {alpha_opt}")
    print(f"Dual optimal value: d* = {dual_opt}")
    print()
    print(f"Primal solution: x* = y* = {x_opt}")
    print(f"Primal optimal value: p* = {primal_opt}")
    print()
    print(f"Duality gap: p* - d* = {primal_opt - dual_opt}")
    print(f"Strong duality holds: {np.isclose(primal_opt, dual_opt)}")
    print()
    
    # Verify constraint
    print(f"Constraint check: x* + y* = {x_opt + y_opt} ≥ 1 ✓")
    print(f"Complementary slackness: α*(1-x*-y*) = {alpha_opt * (1 - x_opt - y_opt)} ✓")
    
    return x_opt, y_opt, alpha_opt
 
dual_problem_derivation()

Why Solve the Dual?

If the primal and dual have the same optimal value (under strong duality), why bother with the dual? Several compelling reasons:

1. Computational Advantages

Sometimes the dual has fewer variables or simpler structure:

Primal might have many variables but few constraints → dual has few variables
The dual is always convex, even if the primal isn't
Interior point methods naturally produce both primal and dual solutions

2. The Kernel Trick Connection

For SVMs, the dual reveals that data only appears through inner products $\mathbf{x}_i^T \mathbf{x}_j$. This enables the kernel trick: replace inner products with kernel evaluations $k(\mathbf{x}_i, \mathbf{x}_j)$ to implicitly work in high-dimensional feature spaces.

3. Structural Insights

Dual variables have interpretations:

In SVMs, $\alpha_i$ identifies support vectors (points with $\alpha_i > 0$)
In economics, dual variables are shadow prices
In physics, dual formulations reveal energy-minimization principles

Dimension Trade-off

If the primal has n variables and m constraints, the dual has m variables and (often) n constraints. When m << n (many variables, few constraints), the dual is much smaller. This is exactly the situation in SVMs with many features but moderate data: the dual has one variable per data point.

When to Use Primal vs Dual
Situation	Prefer	Reason
Few variables, many constraints	Primal	Fewer variables in primal
Many variables, few constraints	Dual	Fewer variables in dual
Non-convex primal	Dual	Dual is always convex (gives bounds)
Need kernel trick	Dual	Data appears as inner products
Interior point method	Both	Naturally computes both

Recovering Primal Solutions from the Dual

If we solve the dual, how do we find the primal solution?

Method 1: Through the Lagrangian

If $\boldsymbol{\alpha}^, \boldsymbol{\beta}^$ is the dual optimal solution, the primal solution $\mathbf{x}^$ minimizes $\mathcal{L}(\mathbf{x}, \boldsymbol{\alpha}^, \boldsymbol{\beta}^*)$.

For many problems (especially quadratic objectives), this minimizer can be expressed in closed form:

$$\mathbf{x}^* = \arg\min_{\mathbf{x}} \mathcal{L}(\mathbf{x}, \boldsymbol{\alpha}^, \boldsymbol{\beta}^)$$

Method 2: KKT Stationarity

From $\nabla_{\mathbf{x}} \mathcal{L} = 0$, solve for $\mathbf{x}$ in terms of $(\boldsymbol{\alpha}^, \boldsymbol{\beta}^)$.

Example (SVM):

In SVMs, the optimal weight vector is recovered as: $$\mathbf{w}^* = \sum_i \alpha_i^* y_i \mathbf{x}_i$$

Only support vectors ($\alpha_i > 0$) contribute to $\mathbf{w}^*$.

Not Always Unique

When strong duality holds, primal optimal values are unique, but optimal solutions may not be. Multiple x* might achieve the same f*. The dual identifies the optimal value and (typically) constrains x* to a specific solution or set of solutions.

The Recovery Process:

Solve the dual to get $(\boldsymbol{\alpha}^, \boldsymbol{\beta}^)$
Substitute into stationarity condition: $\nabla_{\mathbf{x}} \mathcal{L}(\mathbf{x}, \boldsymbol{\alpha}^, \boldsymbol{\beta}^) = 0$
Solve for $\mathbf{x}^*$ — often a linear system for quadratic objectives
Verify feasibility — ensure $\mathbf{x}^*$ satisfies primal constraints (should be automatic if done correctly)

This recovery process is central to algorithms like SMO (Sequential Minimal Optimization) for SVMs.

Geometric Interpretation of Duality

Duality has a beautiful geometric interpretation in terms of supporting hyperplanes.

The G-Graph:

Consider the set of achievable (constraint value, objective value) pairs:

$$\mathcal{G} = {(g(\mathbf{x}), f(\mathbf{x})) : \mathbf{x} \in \mathbb{R}^n} \subset \mathbb{R}^{m+1}$$

The primal asks: what's the lowest objective value ($f$) we can achieve when all constraint values ($g_i$) are non-positive?

Geometrically: project $\mathcal{G}$ onto the "feasible" region where $g_i \leq 0$, and find the minimum $f$ coordinate.

Dual as Supporting Hyperplane:

The dual finds a hyperplane (parameterized by $\boldsymbol{\alpha}$) that:

Lies below all points in $\mathcal{G}$
Is as high as possible at the origin (where $g = 0$)

The intercept of this optimal supporting hyperplane equals the dual optimal value.

Why the Gap?

The duality gap corresponds to points of G that 'bulge below' the supporting hyperplane. For convex problems, G is convex, so supporting hyperplanes touch the boundary—no gap. For non-convex G, the hyperplane might not touch, creating a gap.

Convexity Closes the Gap:

When the primal is convex:

The set $\mathcal{G}$ is convex
Every boundary point has a supporting hyperplane
The optimal supporting hyperplane touches $\mathcal{G}$ at the primal optimum
Therefore $p^* = d^*$

This geometric view makes strong duality intuitive: for convex problems, you can "separate" the feasible region from strictly better objectives using a hyperplane, and this hyperplane corresponds to the dual solution.

Summary: The Power of Duality

Duality provides a profound alternative view of optimization. Let's consolidate:

Key Takeaways

•The dual function provides lower bounds — $d(\boldsymbol{\alpha}, \boldsymbol{\beta}) \leq p^*$ always holds (weak duality), giving certificates of optimality.
•The dual problem is always convex — Even for non-convex primals, the dual is a concave maximization, solvable by convex methods.
•Strong duality closes the gap — For convex problems satisfying Slater's condition, $p^* = d^*$, and solving either problem solves both.
•Dual variables have meaning — They are shadow prices, sensitivity measures, and (in SVMs) they identify support vectors.
•Computational trade-offs — The dual may be easier when primal has many variables but few constraints; this enables the kernel trick.
•Primal recovery is straightforward — Given dual solution, the primal solution comes from the stationarity condition.

What's Next:

We've developed the theory of duality. The next page shows how to explicitly construct the dual problem for important cases, focusing on the dual problem formulation techniques that lead directly to practical algorithms like the SVM dual.

Page Complete

You now understand Lagrangian duality: the dual function, weak and strong duality, conditions for zero duality gap, and the geometric interpretation. This forms the theoretical foundation for understanding how dual formulations reveal problem structure and enable powerful algorithms like kernel SVMs.