Constrained Optimization - Learning Module

Loading content...

0/245

Lagrange Multipliers

The Art of Optimization Under Constraints

Imagine you're designing a neural network and you want to find the weights that minimize the loss function. Simple enough—use gradient descent. But what if you also need the weights to satisfy certain constraints? Perhaps you require the weight vectors to have unit norm (for numerical stability), or the total magnitude of weights to be bounded (for regularization), or certain weights to sum to exactly one (for attention mechanisms).

This is constrained optimization: finding the best solution not in all of space, but within a restricted region defined by constraints. It's the mathematical foundation for Support Vector Machines, portfolio optimization, resource allocation, and countless other problems where you can't simply search everywhere—you must search smartly within boundaries.

What You Will Learn

By the end of this page, you will understand Lagrange multipliers—one of the most elegant and powerful tools in optimization. You'll grasp not just the mechanics but the deep geometric intuition: why the method works, when it succeeds, and how it connects to the broader landscape of machine learning optimization.

The Constrained Optimization Problem

Let's formalize what we mean by constrained optimization. Consider the following general formulation:

$$\min_{\mathbf{x}} f(\mathbf{x}) \quad \text{subject to} \quad g_i(\mathbf{x}) = 0, \quad i = 1, \ldots, m$$

where:

$f: \mathbb{R}^n \to \mathbb{R}$ is the objective function we want to minimize
$g_i: \mathbb{R}^n \to \mathbb{R}$ are equality constraint functions
$\mathbf{x} \in \mathbb{R}^n$ is the decision variable

The constraint $g_i(\mathbf{x}) = 0$ defines a level set—a surface (or manifold) in $\mathbb{R}^n$ where the function $g_i$ equals zero. When you have $m$ constraints, you're looking for solutions that lie on the intersection of $m$ such surfaces.

Key insight: Without constraints, we'd simply find where $\nabla f = 0$ (critical points). With constraints, the optimal point might not be a critical point of $f$ in the traditional sense—the gradient of $f$ need not vanish. Instead, we need a new condition that accounts for being restricted to the constraint surface.

A Classic Example: Maximizing Area with Fixed PerimeterFind the rectangle with maximum area given a fixed perimeter of 100 units.

Input

Perimeter constraint: 2x + 2y = 100

Output

Maximize f(x,y) = xy subject to g(x,y) = 2x + 2y - 100 = 0

Explanation

This seemingly simple problem becomes trivial with Lagrange multipliers. The constraint restricts us to a line in the (x,y) plane, and we seek the point on this line where xy is maximized. We'll solve this shortly and find x = y = 25.

Why Not Just Substitute?

For simple constraints like 2x + 2y = 100, you could solve for y = 50 - x and substitute into f(x,y) = x(50-x), reducing to a single-variable problem. But this approach fails when: (1) constraints can't be solved explicitly, (2) you have multiple constraints, (3) constraints are nonlinear and coupled, or (4) you need to understand the structure of the solution. Lagrange multipliers handle all these cases elegantly.

The Geometric Intuition

The power of Lagrange multipliers comes from a beautiful geometric insight. Let's build this intuition step by step.

Visualization Setup:

Consider minimizing $f(x, y)$ subject to $g(x, y) = 0$. Imagine:

The constraint $g(x, y) = 0$ as a curve (or more generally, a surface) in your space
The contour lines of $f(x, y)$—curves where $f$ takes constant values

As you move along the constraint curve, the objective function $f$ changes value. At the optimal point, something special happens: you can't improve $f$ by moving along the constraint.

The Tangency Condition:

At the optimum, the level curve of $f$ is tangent to the constraint curve $g = 0$. Why? If the curves crossed transversally (at an angle), you could slide along the constraint and either increase or decrease $f$—meaning you're not at an optimum.

Mathematically, tangency means the curves have the same tangent line. Equivalently, their normal vectors are parallel.

The Gradient is Normal to Level Sets

Recall that $\nabla f$ points in the direction of steepest increase of $f$, and it's perpendicular to the level curves of $f$. Similarly, $\nabla g$ is perpendicular to the level curves of $g$. At a point where the level curve of $f$ is tangent to $g = 0$ (which is a level curve of $g$), the two normal vectors must be parallel.

The Core Condition:

Two vectors being parallel means one is a scalar multiple of the other:

$$\nabla f(\mathbf{x}^) = \lambda \nabla g(\mathbf{x}^)$$

for some scalar $\lambda$. This scalar is the Lagrange multiplier.

Combined with the constraint $g(\mathbf{x}^*) = 0$, we have a system of equations:

$$\begin{cases} \nabla f(\mathbf{x}^) = \lambda \nabla g(\mathbf{x}^) \ g(\mathbf{x}^*) = 0 \end{cases}$$

This gives us $n + 1$ equations (one for each component of the gradient equality, plus the constraint) in $n + 1$ unknowns ($n$ components of $\mathbf{x}^*$ and $\lambda$). In principle, this is solvable!

Gradient Interpretation at Constrained Optimum
Mathematical Condition	Geometric Meaning	Intuition
$\nabla f = \lambda \nabla g$	Gradients are parallel	Level curves are tangent
$g(\mathbf{x}^*) = 0$	Point lies on constraint	Solution must be feasible
$\nabla f \neq 0$	$f$ is not at a critical point	Constraint actively restricts
$\lambda > 0$ (for min of $f$, max of $g$)	Gradients point same direction	Constraint prevents further decrease

The Lagrangian Function

We can encode the necessary conditions into a single function called the Lagrangian:

$$\mathcal{L}(\mathbf{x}, \lambda) = f(\mathbf{x}) - \lambda , g(\mathbf{x})$$

Note: Some authors use $+\lambda$ instead of $-\lambda$; the convention doesn't matter as long as you're consistent.

Why This Works:

Taking the gradient with respect to $\mathbf{x}$:

$$\nabla_{\mathbf{x}} \mathcal{L} = \nabla f - \lambda \nabla g$$

Setting this to zero gives us:

$$\nabla f = \lambda \nabla g \quad \checkmark$$

Taking the derivative with respect to $\lambda$:

$$\frac{\partial \mathcal{L}}{\partial \lambda} = -g(\mathbf{x})$$

Setting this to zero gives us:

$$g(\mathbf{x}) = 0 \quad \checkmark$$

So the critical points of the Lagrangian (where all partial derivatives vanish) are exactly the points satisfying our necessary conditions for a constrained optimum!

The Lagrangian Transforms Constrained to Unconstrained

This is the key insight: The Lagrangian converts a constrained optimization problem into an unconstrained one (sort of). Instead of minimizing $f$ subject to $g=0$, we find critical points of $\mathcal{L}$ over both $\mathbf{x}$ and $\lambda$. The constraint is 'absorbed' into the objective via the Lagrange multiplier.

For Multiple Constraints:

With $m$ equality constraints $g_1(\mathbf{x}) = 0, \ldots, g_m(\mathbf{x}) = 0$, we introduce $m$ Lagrange multipliers $\lambda_1, \ldots, \lambda_m$:

$$\mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}) = f(\mathbf{x}) - \sum_{i=1}^{m} \lambda_i , g_i(\mathbf{x})$$

The necessary conditions become:

$$\nabla f(\mathbf{x}) = \sum_{i=1}^{m} \lambda_i \nabla g_i(\mathbf{x})$$

Geometrically, this says the gradient of $f$ lies in the span of the constraint gradients—which makes sense because we can't move in any direction perpendicular to all constraints (those are the only directions that keep us feasible).

lagrangian_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import numpy as np
from scipy.optimize import minimize
 
def lagrangian_method_example():
    """
    Example: Minimize f(x,y) = x² + y² 
    subject to g(x,y) = x + y - 1 = 0
    
    The Lagrangian: L(x,y,λ) = x² + y² - λ(x + y - 1)
    
    Setting gradients to zero:
    ∂L/∂x = 2x - λ = 0  →  x = λ/2
    ∂L/∂y = 2y - λ = 0  →  y = λ/2
    ∂L/∂λ = -(x + y - 1) = 0  →  x + y = 1
    
    Substituting: λ/2 + λ/2 = 1  →  λ = 1
    Therefore: x = y = 1/2
    """
    
    # Analytical solution
    x_opt, y_opt = 0.5, 0.5
    lambda_opt = 1.0
    
    print("Analytical Solution (Lagrange Multipliers):")
    print(f"  Optimal point: ({x_opt}, {y_opt})")
    print(f"  Optimal value: f(x,y) = {x_opt**2 + y_opt**2}")
    print(f"  Lagrange multiplier: λ = {lambda_opt}")
    
    # Verify with constraint
    print(f"  Constraint check: x + y - 1 = {x_opt + y_opt - 1}")
    
    # Numerical verification using scipy
    def objective(xy):
        return xy[0]**2 + xy[1]**2
    
    def constraint(xy):
        return xy[0] + xy[1] - 1
    
    result = minimize(
        objective, 
        x0=[0, 0],  # starting point
        constraints={'type': 'eq', 'fun': constraint},
        method='SLSQP'
    )
    
    print("\nNumerical Verification (scipy.optimize):")
    print(f"  Optimal point: ({result.x[0]:.6f}, {result.x[1]:.6f})")
    print(f"  Optimal value: {result.fun:.6f}")
    
    return x_opt, y_opt, lambda_opt
 
lagrangian_method_example()

Interpreting Lagrange Multipliers: Sensitivity Analysis

The Lagrange multiplier $\lambda$ isn't just a mathematical artifact—it has profound meaning. It measures the sensitivity of the optimal objective value to changes in the constraint.

The Envelope Theorem Perspective:

Consider the parametrized problem:

$$\min_{\mathbf{x}} f(\mathbf{x}) \quad \text{subject to} \quad g(\mathbf{x}) = c$$

Let $V(c) = f(\mathbf{x}^*(c))$ be the optimal objective value as a function of the constraint parameter $c$. Then:

$$\frac{dV}{dc} = \lambda^*$$

In words: The Lagrange multiplier tells you how much the optimal objective would improve (or worsen) if you relaxed the constraint slightly.

Economic InterpretationConsider maximizing profit subject to a resource constraint.

Input

If you have 100 units of resource and λ* = 5 at the optimum

Output

Getting 1 more unit of resource would increase optimal profit by approximately $5

Explanation

This is why Lagrange multipliers are called 'shadow prices' in economics—they represent the marginal value of relaxing each constraint. In ML, they tell you how much your loss function could improve if constraints were loosened.

Sign Convention Matters

With Lagrangian $\mathcal{L} = f - \lambda g$, a positive λ for minimizing f subject to g = 0 means: if we could make g slightly positive (relax the constraint to g ≥ 0), the objective would decrease. The multiplier quantifies the 'cost' of maintaining the constraint at equality.

ML Applications of Multiplier Interpretation:

Context	Constraint	λ Interpretation
SVM margin maximization	Data point on margin	Importance of support vector
Portfolio optimization	Budget constraint	Marginal value of additional capital
Neural network pruning	Sparsity constraint	Cost of adding more parameters
Fairness in ML	Demographic parity	Trade-off between accuracy and fairness

Understanding $\lambda$ helps you reason about:

Which constraints are "binding" (active) vs. slack
How to prioritize constraint relaxation
The trade-offs inherent in your optimization problem

Solving Lagrangian Systems

Let's work through the complete methodology for solving constrained optimization problems using Lagrange multipliers.

Step-by-Step Procedure:

Formulate the Lagrangian: $\mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}) = f(\mathbf{x}) - \sum_i \lambda_i g_i(\mathbf{x})$
Compute all partial derivatives: Find $\nabla_{\mathbf{x}} \mathcal{L}$ and $\nabla_{\boldsymbol{\lambda}} \mathcal{L}$
Set derivatives to zero: This gives you a system of equations
Solve the system: Find all candidate solutions $(\mathbf{x}^, \boldsymbol{\lambda}^)$
Verify second-order conditions: Confirm whether candidates are minima, maxima, or saddle points
Evaluate objective: Compare objective values at all candidates to find global optimum

Complete Worked Example: Entropy MaximizationMaximize entropy H(p₁, p₂, p₃) = -Σpᵢlog(pᵢ) subject to Σpᵢ = 1

Input

Probability distribution must sum to 1: p₁ + p₂ + p₃ = 1

Output

Maximum entropy distribution is uniform: p₁ = p₂ = p₃ = 1/3

Explanation

This is the principle of maximum entropy—among all distributions satisfying given constraints, the most 'random' (highest entropy) is often the most appropriate if we have no other information.

entropy_maximization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import numpy as np
from scipy.optimize import minimize
 
def maximum_entropy_example():
    """
    Maximize H(p) = -Σ pᵢ log(pᵢ) subject to Σ pᵢ = 1
    
    Lagrangian: L = -Σ pᵢ log(pᵢ) - λ(Σ pᵢ - 1)
    
    Setting ∂L/∂pᵢ = 0:
    -log(pᵢ) - 1 - λ = 0
    log(pᵢ) = -(1 + λ)
    pᵢ = exp(-(1 + λ))
    
    All pᵢ are equal! Since Σ pᵢ = 1 and we have n components:
    n * exp(-(1 + λ)) = 1
    pᵢ = 1/n for all i
    
    This proves the uniform distribution maximizes entropy.
    """
    n = 3  # number of outcomes
    
    # Analytical solution
    p_uniform = np.array([1/n] * n)
    entropy_max = -np.sum(p_uniform * np.log(p_uniform))
    
    print("Maximum Entropy Problem (3 outcomes)")
    print("=" * 50)
    print(f"\nAnalytical Solution:")
    print(f"  Optimal distribution: {p_uniform}")
    print(f"  Maximum entropy: {entropy_max:.6f} nats")
    print(f"  In bits: {entropy_max / np.log(2):.6f} bits")
    
    # Compute λ from the solution
    # From pᵢ = exp(-(1 + λ)) = 1/n
    # We get -(1 + λ) = log(1/n) = -log(n)
    # Therefore λ = log(n) - 1
    lambda_opt = np.log(n) - 1
    print(f"  Lagrange multiplier λ = {lambda_opt:.6f}")
    
    # Verify: compare with non-uniform distributions
    print(f"\nComparison with other distributions:")
    
    p_skewed = np.array([0.7, 0.2, 0.1])
    entropy_skewed = -np.sum(p_skewed * np.log(p_skewed))
    print(f"  Skewed [0.7, 0.2, 0.1]: H = {entropy_skewed:.6f} nats")
    
    p_peaked = np.array([0.99, 0.005, 0.005])
    entropy_peaked = -np.sum(p_peaked * np.log(p_peaked))
    print(f"  Peaked [0.99, 0.005, 0.005]: H = {entropy_peaked:.6f} nats")
    
    print(f"\n  Uniform distribution has MAXIMUM entropy ✓")
    
    return p_uniform, entropy_max
 
maximum_entropy_example()

Common Pitfalls:

Forgetting to check constraint qualification: If $\nabla g_i$ are linearly dependent at the solution, Lagrange conditions may fail (see next section)
Finding only critical points, not optima: The method finds necessary conditions; second-order tests or boundary analysis may be needed
Sign errors: Be consistent with your Lagrangian formulation ($f + \lambda g$ vs $f - \lambda g$)
Missing solutions: Multiple local optima may exist; the method finds candidates, not necessarily the global optimum

Constraint Qualification: When Lagrange Works

Lagrange multipliers are powerful, but they require a technical condition called constraint qualification to guarantee that the necessary conditions are actually necessary.

Linear Independence Constraint Qualification (LICQ):

At any feasible point $\mathbf{x}$, the gradients of active constraints ${\nabla g_i(\mathbf{x}) : g_i(\mathbf{x}) = 0}$ must be linearly independent.

Why This Matters:

If constraint gradients are parallel or zero at the optimum, the Lagrangian setup can fail. Consider:

$$\min x^2 + y^2 \quad \text{subject to} \quad (x-1)^3 = y$$

At the point $(1, 0)$, the constraint is satisfied and the objective is minimized on the constraint. But $\nabla g = (3(x-1)^2, -1) = (0, -1)$ at $(1,0)$, so $\nabla g \neq 0$, LICQ holds, and Lagrange conditions apply.

Now consider a poorly-behaved constraint where the gradient vanishes at the solution—Lagrange conditions may still hold but aren't guaranteed.

When Lagrange Multipliers Can Fail

LICQ fails when: (1) Two constraint gradients are parallel at the solution, (2) A constraint gradient is zero at the solution (degenerate constraint), (3) You have more active constraints than dimensions. In practice, most 'reasonable' problems satisfy LICQ, but it's important to be aware of these edge cases.

Other Constraint Qualifications:

Beyond LICQ, there are weaker conditions that still ensure Lagrange conditions are necessary:

Constraint Qualification	Description
LICQ	Active constraint gradients are linearly independent
MFCQ (Mangasarian-Fromovitz)	Weaker than LICQ; allows redundant constraints
Slater's Condition	For convex problems: a strictly feasible point exists
Linear constraints	Always satisfy CQ if feasible region is non-empty

For most ML applications—especially when constraints are linear (like $\mathbf{w}^T \mathbf{w} \leq C$ or $\sum_i \alpha_i = 0$)—constraint qualification holds automatically.

Second-Order Conditions: Confirming Optimality

The Lagrangian conditions $\nabla_{\mathbf{x}} \mathcal{L} = 0$ and $g(\mathbf{x}) = 0$ are necessary but not sufficient. A critical point of the Lagrangian could be a minimum, maximum, or saddle point of the original problem. We need second-order conditions to distinguish.

The Bordered Hessian:

For constrained optimization, we examine the Hessian of the Lagrangian restricted to the constraint surface:

$$H_\mathcal{L} = \nabla^2_{\mathbf{x}} \mathcal{L} = \nabla^2 f - \sum_i \lambda_i \nabla^2 g_i$$

Second-Order Necessary Conditions (for minimum):

For any vector $\mathbf{v}$ tangent to the constraint surface (i.e., $\nabla g_i \cdot \mathbf{v} = 0$ for all $i$):

$$\mathbf{v}^T H_\mathcal{L} \mathbf{v} \geq 0$$

Second-Order Sufficient Conditions:

If the above inequality is strict ($> 0$) for all nonzero tangent vectors, then the critical point is a strict local minimum.

Convexity Simplifies Everything

If $f$ is convex and the constraints $g_i$ are affine (linear), then every critical point of the Lagrangian is a global minimum of the constrained problem. No second-order checking needed! This is why convex optimization is so attractive—Lagrangian conditions are both necessary and sufficient.

Practical Verification:

For single-constraint problems in two variables, the bordered Hessian test involves a 3×3 determinant:

$$\bar{H} = \begin{pmatrix} 0 & g_x & g_y \ g_x & \mathcal{L}{xx} & \mathcal{L}{xy} \ g_y & \mathcal{L}{xy} & \mathcal{L}{yy} \end{pmatrix}$$

If $\det(\bar{H}) > 0$ at a critical point, it's a local maximum. If $\det(\bar{H}) < 0$, it's a local minimum.

For higher dimensions and multiple constraints, the bordered Hessian generalizes but becomes more complex. In practice, for convex problems or when numerical methods are used, these tests are often bypassed by the inherent structure of the algorithm.

Lagrange Multipliers in Machine Learning

Lagrange multipliers appear throughout machine learning, often in ways that aren't immediately obvious. Here's a preview of key applications we'll develop in this module:

1. Support Vector Machines (Hard Margin)

The primal SVM problem:

$$\min_{\mathbf{w}, b} \frac{1}{2}|\mathbf{w}|^2 \quad \text{subject to} \quad y_i(\mathbf{w}^T \mathbf{x}_i + b) \geq 1$$

The Lagrange multipliers $\alpha_i$ become the coefficients of support vectors—the only data points that influence the decision boundary. Points with $\alpha_i > 0$ are support vectors; points with $\alpha_i = 0$ don't affect the solution.

2. Principal Component Analysis (PCA)

PCA maximizes variance subject to orthonormality constraints:

$$\max_{\mathbf{w}} \mathbf{w}^T \Sigma \mathbf{w} \quad \text{subject to} \quad |\mathbf{w}|^2 = 1$$

The Lagrange multipliers turn out to be the eigenvalues of the covariance matrix!

Lagrange Multipliers Across ML
Application	Constraint	Role of λ
SVM	$y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1$	Support vector weights
PCA	$\|\mathbf{w}\| = 1$	Eigenvalue (variance explained)
Maximum Entropy	$\sum p_i = 1$	Normalization constant
Regularized regression	$\|\mathbf{w}\|^2 \leq t$	Regularization strength
Attention mechanisms	$\sum a_i = 1$	Temperature parameter (softmax)

The Dual Perspective

In many ML applications, we don't solve the primal Lagrangian conditions directly. Instead, we construct the 'dual problem' that optimizes over the Lagrange multipliers themselves. This dual perspective often reveals elegant structure: the kernel trick in SVMs, efficient algorithms, and insights into model behavior. We'll explore duality in depth later in this module.

Summary: The Power of Lagrange Multipliers

We've covered the fundamental theory of Lagrange multipliers. Let's consolidate the key insights:

Key Takeaways

•Lagrange multipliers convert constrained to unconstrained optimization — By introducing auxiliary variables, we can use calculus on the Lagrangian to find constrained optima.
•The geometric insight is tangency — At a constrained optimum, the objective function's level set is tangent to the constraint surface, meaning their gradients are parallel.
•Multipliers have meaning — Each $\lambda_i$ measures the sensitivity of the optimal objective to the corresponding constraint; they are 'shadow prices' of constraints.
•Constraint qualification matters — The method works when constraint gradients are well-behaved (linearly independent, non-zero).
•Convexity simplifies checking — For convex objectives and affine constraints, Lagrangian critical points are global minima.
•Lagrange multipliers pervade ML — From SVMs to PCA to entropy maximization, this framework underlies many core algorithms.

What's Next:

Lagrange multipliers handle equality constraints elegantly. But what about inequality constraints like $g(\mathbf{x}) \leq 0$? The next page introduces the Karush-Kuhn-Tucker (KKT) conditions—the generalization of Lagrange multipliers to inequality-constrained problems. These are fundamental to understanding SVMs, constrained regression, and modern optimization algorithms.

Page Complete

You now understand Lagrange multipliers for equality-constrained optimization. You can formulate Lagrangians, solve the resulting systems, interpret the multipliers, and recognize their role in ML. Next, we extend this framework to handle inequality constraints through KKT conditions.

Lagrange Multipliers

The Art of Optimization Under Constraints

What You Will Learn

The Constrained Optimization Problem

Let's formalize what we mean by constrained optimization. Consider the following general formulation:

$$\min_{\mathbf{x}} f(\mathbf{x}) \quad \text{subject to} \quad g_i(\mathbf{x}) = 0, \quad i = 1, \ldots, m$$

where:

$f: \mathbb{R}^n \to \mathbb{R}$ is the objective function we want to minimize
$g_i: \mathbb{R}^n \to \mathbb{R}$ are equality constraint functions
$\mathbf{x} \in \mathbb{R}^n$ is the decision variable

A Classic Example: Maximizing Area with Fixed PerimeterFind the rectangle with maximum area given a fixed perimeter of 100 units.

Input

Perimeter constraint: 2x + 2y = 100

Output

Maximize f(x,y) = xy subject to g(x,y) = 2x + 2y - 100 = 0

Explanation

Why Not Just Substitute?

The Geometric Intuition

The power of Lagrange multipliers comes from a beautiful geometric insight. Let's build this intuition step by step.

Visualization Setup:

Consider minimizing $f(x, y)$ subject to $g(x, y) = 0$. Imagine:

The constraint $g(x, y) = 0$ as a curve (or more generally, a surface) in your space
The contour lines of $f(x, y)$—curves where $f$ takes constant values

As you move along the constraint curve, the objective function $f$ changes value. At the optimal point, something special happens: you can't improve $f$ by moving along the constraint.

The Tangency Condition:

Mathematically, tangency means the curves have the same tangent line. Equivalently, their normal vectors are parallel.

The Gradient is Normal to Level Sets

The Core Condition:

Two vectors being parallel means one is a scalar multiple of the other:

$$\nabla f(\mathbf{x}^) = \lambda \nabla g(\mathbf{x}^)$$

for some scalar $\lambda$. This scalar is the Lagrange multiplier.

Combined with the constraint $g(\mathbf{x}^*) = 0$, we have a system of equations:

$$\begin{cases} \nabla f(\mathbf{x}^) = \lambda \nabla g(\mathbf{x}^) \ g(\mathbf{x}^*) = 0 \end{cases}$$

Gradient Interpretation at Constrained Optimum
Mathematical Condition	Geometric Meaning	Intuition
$\nabla f = \lambda \nabla g$	Gradients are parallel	Level curves are tangent
$g(\mathbf{x}^*) = 0$	Point lies on constraint	Solution must be feasible
$\nabla f \neq 0$	$f$ is not at a critical point	Constraint actively restricts
$\lambda > 0$ (for min of $f$, max of $g$)	Gradients point same direction	Constraint prevents further decrease

The Lagrangian Function

We can encode the necessary conditions into a single function called the Lagrangian:

$$\mathcal{L}(\mathbf{x}, \lambda) = f(\mathbf{x}) - \lambda , g(\mathbf{x})$$

Note: Some authors use $+\lambda$ instead of $-\lambda$; the convention doesn't matter as long as you're consistent.

Why This Works:

Taking the gradient with respect to $\mathbf{x}$:

$$\nabla_{\mathbf{x}} \mathcal{L} = \nabla f - \lambda \nabla g$$

Setting this to zero gives us:

$$\nabla f = \lambda \nabla g \quad \checkmark$$

Taking the derivative with respect to $\lambda$:

$$\frac{\partial \mathcal{L}}{\partial \lambda} = -g(\mathbf{x})$$

Setting this to zero gives us:

$$g(\mathbf{x}) = 0 \quad \checkmark$$

So the critical points of the Lagrangian (where all partial derivatives vanish) are exactly the points satisfying our necessary conditions for a constrained optimum!

The Lagrangian Transforms Constrained to Unconstrained

For Multiple Constraints:

With $m$ equality constraints $g_1(\mathbf{x}) = 0, \ldots, g_m(\mathbf{x}) = 0$, we introduce $m$ Lagrange multipliers $\lambda_1, \ldots, \lambda_m$:

$$\mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}) = f(\mathbf{x}) - \sum_{i=1}^{m} \lambda_i , g_i(\mathbf{x})$$

The necessary conditions become:

$$\nabla f(\mathbf{x}) = \sum_{i=1}^{m} \lambda_i \nabla g_i(\mathbf{x})$$

lagrangian_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import numpy as np
from scipy.optimize import minimize
 
def lagrangian_method_example():
    """
    Example: Minimize f(x,y) = x² + y² 
    subject to g(x,y) = x + y - 1 = 0
    
    The Lagrangian: L(x,y,λ) = x² + y² - λ(x + y - 1)
    
    Setting gradients to zero:
    ∂L/∂x = 2x - λ = 0  →  x = λ/2
    ∂L/∂y = 2y - λ = 0  →  y = λ/2
    ∂L/∂λ = -(x + y - 1) = 0  →  x + y = 1
    
    Substituting: λ/2 + λ/2 = 1  →  λ = 1
    Therefore: x = y = 1/2
    """
    
    # Analytical solution
    x_opt, y_opt = 0.5, 0.5
    lambda_opt = 1.0
    
    print("Analytical Solution (Lagrange Multipliers):")
    print(f"  Optimal point: ({x_opt}, {y_opt})")
    print(f"  Optimal value: f(x,y) = {x_opt**2 + y_opt**2}")
    print(f"  Lagrange multiplier: λ = {lambda_opt}")
    
    # Verify with constraint
    print(f"  Constraint check: x + y - 1 = {x_opt + y_opt - 1}")
    
    # Numerical verification using scipy
    def objective(xy):
        return xy[0]**2 + xy[1]**2
    
    def constraint(xy):
        return xy[0] + xy[1] - 1
    
    result = minimize(
        objective, 
        x0=[0, 0],  # starting point
        constraints={'type': 'eq', 'fun': constraint},
        method='SLSQP'
    )
    
    print("\nNumerical Verification (scipy.optimize):")
    print(f"  Optimal point: ({result.x[0]:.6f}, {result.x[1]:.6f})")
    print(f"  Optimal value: {result.fun:.6f}")
    
    return x_opt, y_opt, lambda_opt
 
lagrangian_method_example()

Interpreting Lagrange Multipliers: Sensitivity Analysis

The Lagrange multiplier $\lambda$ isn't just a mathematical artifact—it has profound meaning. It measures the sensitivity of the optimal objective value to changes in the constraint.

The Envelope Theorem Perspective:

Consider the parametrized problem:

$$\min_{\mathbf{x}} f(\mathbf{x}) \quad \text{subject to} \quad g(\mathbf{x}) = c$$

Let $V(c) = f(\mathbf{x}^*(c))$ be the optimal objective value as a function of the constraint parameter $c$. Then:

$$\frac{dV}{dc} = \lambda^*$$

In words: The Lagrange multiplier tells you how much the optimal objective would improve (or worsen) if you relaxed the constraint slightly.

Economic InterpretationConsider maximizing profit subject to a resource constraint.

Input

If you have 100 units of resource and λ* = 5 at the optimum

Output

Getting 1 more unit of resource would increase optimal profit by approximately $5

Explanation

Sign Convention Matters

ML Applications of Multiplier Interpretation:

Context	Constraint	λ Interpretation
SVM margin maximization	Data point on margin	Importance of support vector
Portfolio optimization	Budget constraint	Marginal value of additional capital
Neural network pruning	Sparsity constraint	Cost of adding more parameters
Fairness in ML	Demographic parity	Trade-off between accuracy and fairness

Understanding $\lambda$ helps you reason about:

Which constraints are "binding" (active) vs. slack
How to prioritize constraint relaxation
The trade-offs inherent in your optimization problem

Solving Lagrangian Systems

Let's work through the complete methodology for solving constrained optimization problems using Lagrange multipliers.

Step-by-Step Procedure:

Formulate the Lagrangian: $\mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}) = f(\mathbf{x}) - \sum_i \lambda_i g_i(\mathbf{x})$
Compute all partial derivatives: Find $\nabla_{\mathbf{x}} \mathcal{L}$ and $\nabla_{\boldsymbol{\lambda}} \mathcal{L}$
Set derivatives to zero: This gives you a system of equations
Solve the system: Find all candidate solutions $(\mathbf{x}^, \boldsymbol{\lambda}^)$
Verify second-order conditions: Confirm whether candidates are minima, maxima, or saddle points
Evaluate objective: Compare objective values at all candidates to find global optimum

Complete Worked Example: Entropy MaximizationMaximize entropy H(p₁, p₂, p₃) = -Σpᵢlog(pᵢ) subject to Σpᵢ = 1

Input

Probability distribution must sum to 1: p₁ + p₂ + p₃ = 1

Output

Maximum entropy distribution is uniform: p₁ = p₂ = p₃ = 1/3

Explanation

This is the principle of maximum entropy—among all distributions satisfying given constraints, the most 'random' (highest entropy) is often the most appropriate if we have no other information.

entropy_maximization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import numpy as np
from scipy.optimize import minimize
 
def maximum_entropy_example():
    """
    Maximize H(p) = -Σ pᵢ log(pᵢ) subject to Σ pᵢ = 1
    
    Lagrangian: L = -Σ pᵢ log(pᵢ) - λ(Σ pᵢ - 1)
    
    Setting ∂L/∂pᵢ = 0:
    -log(pᵢ) - 1 - λ = 0
    log(pᵢ) = -(1 + λ)
    pᵢ = exp(-(1 + λ))
    
    All pᵢ are equal! Since Σ pᵢ = 1 and we have n components:
    n * exp(-(1 + λ)) = 1
    pᵢ = 1/n for all i
    
    This proves the uniform distribution maximizes entropy.
    """
    n = 3  # number of outcomes
    
    # Analytical solution
    p_uniform = np.array([1/n] * n)
    entropy_max = -np.sum(p_uniform * np.log(p_uniform))
    
    print("Maximum Entropy Problem (3 outcomes)")
    print("=" * 50)
    print(f"\nAnalytical Solution:")
    print(f"  Optimal distribution: {p_uniform}")
    print(f"  Maximum entropy: {entropy_max:.6f} nats")
    print(f"  In bits: {entropy_max / np.log(2):.6f} bits")
    
    # Compute λ from the solution
    # From pᵢ = exp(-(1 + λ)) = 1/n
    # We get -(1 + λ) = log(1/n) = -log(n)
    # Therefore λ = log(n) - 1
    lambda_opt = np.log(n) - 1
    print(f"  Lagrange multiplier λ = {lambda_opt:.6f}")
    
    # Verify: compare with non-uniform distributions
    print(f"\nComparison with other distributions:")
    
    p_skewed = np.array([0.7, 0.2, 0.1])
    entropy_skewed = -np.sum(p_skewed * np.log(p_skewed))
    print(f"  Skewed [0.7, 0.2, 0.1]: H = {entropy_skewed:.6f} nats")
    
    p_peaked = np.array([0.99, 0.005, 0.005])
    entropy_peaked = -np.sum(p_peaked * np.log(p_peaked))
    print(f"  Peaked [0.99, 0.005, 0.005]: H = {entropy_peaked:.6f} nats")
    
    print(f"\n  Uniform distribution has MAXIMUM entropy ✓")
    
    return p_uniform, entropy_max
 
maximum_entropy_example()

Common Pitfalls:

Forgetting to check constraint qualification: If $\nabla g_i$ are linearly dependent at the solution, Lagrange conditions may fail (see next section)
Finding only critical points, not optima: The method finds necessary conditions; second-order tests or boundary analysis may be needed
Sign errors: Be consistent with your Lagrangian formulation ($f + \lambda g$ vs $f - \lambda g$)
Missing solutions: Multiple local optima may exist; the method finds candidates, not necessarily the global optimum

Constraint Qualification: When Lagrange Works

Lagrange multipliers are powerful, but they require a technical condition called constraint qualification to guarantee that the necessary conditions are actually necessary.

Linear Independence Constraint Qualification (LICQ):

At any feasible point $\mathbf{x}$, the gradients of active constraints ${\nabla g_i(\mathbf{x}) : g_i(\mathbf{x}) = 0}$ must be linearly independent.

Why This Matters:

If constraint gradients are parallel or zero at the optimum, the Lagrangian setup can fail. Consider:

$$\min x^2 + y^2 \quad \text{subject to} \quad (x-1)^3 = y$$

Now consider a poorly-behaved constraint where the gradient vanishes at the solution—Lagrange conditions may still hold but aren't guaranteed.

When Lagrange Multipliers Can Fail

Other Constraint Qualifications:

Beyond LICQ, there are weaker conditions that still ensure Lagrange conditions are necessary:

Constraint Qualification	Description
LICQ	Active constraint gradients are linearly independent
MFCQ (Mangasarian-Fromovitz)	Weaker than LICQ; allows redundant constraints
Slater's Condition	For convex problems: a strictly feasible point exists
Linear constraints	Always satisfy CQ if feasible region is non-empty

For most ML applications—especially when constraints are linear (like $\mathbf{w}^T \mathbf{w} \leq C$ or $\sum_i \alpha_i = 0$)—constraint qualification holds automatically.

Second-Order Conditions: Confirming Optimality

The Bordered Hessian:

For constrained optimization, we examine the Hessian of the Lagrangian restricted to the constraint surface:

$$H_\mathcal{L} = \nabla^2_{\mathbf{x}} \mathcal{L} = \nabla^2 f - \sum_i \lambda_i \nabla^2 g_i$$

Second-Order Necessary Conditions (for minimum):

For any vector $\mathbf{v}$ tangent to the constraint surface (i.e., $\nabla g_i \cdot \mathbf{v} = 0$ for all $i$):

$$\mathbf{v}^T H_\mathcal{L} \mathbf{v} \geq 0$$

Second-Order Sufficient Conditions:

If the above inequality is strict ($> 0$) for all nonzero tangent vectors, then the critical point is a strict local minimum.

Convexity Simplifies Everything

Practical Verification:

For single-constraint problems in two variables, the bordered Hessian test involves a 3×3 determinant:

$$\bar{H} = \begin{pmatrix} 0 & g_x & g_y \ g_x & \mathcal{L}{xx} & \mathcal{L}{xy} \ g_y & \mathcal{L}{xy} & \mathcal{L}{yy} \end{pmatrix}$$

If $\det(\bar{H}) > 0$ at a critical point, it's a local maximum. If $\det(\bar{H}) < 0$, it's a local minimum.

Lagrange Multipliers in Machine Learning

Lagrange multipliers appear throughout machine learning, often in ways that aren't immediately obvious. Here's a preview of key applications we'll develop in this module:

1. Support Vector Machines (Hard Margin)

The primal SVM problem:

$$\min_{\mathbf{w}, b} \frac{1}{2}|\mathbf{w}|^2 \quad \text{subject to} \quad y_i(\mathbf{w}^T \mathbf{x}_i + b) \geq 1$$

2. Principal Component Analysis (PCA)

PCA maximizes variance subject to orthonormality constraints:

$$\max_{\mathbf{w}} \mathbf{w}^T \Sigma \mathbf{w} \quad \text{subject to} \quad |\mathbf{w}|^2 = 1$$

The Lagrange multipliers turn out to be the eigenvalues of the covariance matrix!

Lagrange Multipliers Across ML
Application	Constraint	Role of λ
SVM	$y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1$	Support vector weights
PCA	$\|\mathbf{w}\| = 1$	Eigenvalue (variance explained)
Maximum Entropy	$\sum p_i = 1$	Normalization constant
Regularized regression	$\|\mathbf{w}\|^2 \leq t$	Regularization strength
Attention mechanisms	$\sum a_i = 1$	Temperature parameter (softmax)

The Dual Perspective

Summary: The Power of Lagrange Multipliers

We've covered the fundamental theory of Lagrange multipliers. Let's consolidate the key insights:

Key Takeaways

•Lagrange multipliers convert constrained to unconstrained optimization — By introducing auxiliary variables, we can use calculus on the Lagrangian to find constrained optima.
•The geometric insight is tangency — At a constrained optimum, the objective function's level set is tangent to the constraint surface, meaning their gradients are parallel.
•Multipliers have meaning — Each $\lambda_i$ measures the sensitivity of the optimal objective to the corresponding constraint; they are 'shadow prices' of constraints.
•Constraint qualification matters — The method works when constraint gradients are well-behaved (linearly independent, non-zero).
•Convexity simplifies checking — For convex objectives and affine constraints, Lagrangian critical points are global minima.
•Lagrange multipliers pervade ML — From SVMs to PCA to entropy maximization, this framework underlies many core algorithms.

What's Next:

Page Complete