Loading content...
We've built a rigorous foundation: convex sets, convex functions, first and second-order optimality conditions, and the guarantee that local minima are global minima. Now comes the crucial question: Which machine learning problems are actually convex?
The answer shapes the reliability of ML systems. For convex problems, we have mathematical guarantees: gradient descent will find the global optimum, the solution is unique (for strictly convex problems), and convergence rates are well understood. For non-convex problems, we have heuristics, hopes, and empirical observation—but no guarantees.
This page maps the ML landscape through the lens of convexity. We'll identify which classic algorithms enjoy convex formulations, understand why neural networks are fundamentally non-convex, and explore techniques for either enforcing convexity or dealing gracefully with its absence.
By the end of this page, you will understand: (1) which ML algorithms have convex objectives, (2) why neural networks break convexity, (3) convex relaxations and surrogate losses, (4) when to insist on convexity vs. accept non-convexity, and (5) practical strategies for convex ML design.
Let's systematically categorize major ML algorithms by their convexity properties. This taxonomy is essential for understanding when we can trust our training procedures.
| Algorithm | Convexity | Loss Function | Notes |
|---|---|---|---|
| Linear Regression (OLS) | ✓ Convex (strongly) | Squared error | Closed-form solution exists |
| Ridge Regression | ✓ Convex (strongly) | Squared + L2 penalty | $\lambda$-strongly convex |
| Lasso Regression | ✓ Convex | Squared + L1 penalty | Not strictly convex; sparse solutions |
| Elastic Net | ✓ Convex (strongly) | Squared + L1 + L2 | Best of both worlds |
| Logistic Regression | ✓ Convex (strongly with L2) | Log loss | Strictly convex; strongly with regularization |
| Softmax Regression | ✓ Convex | Cross-entropy | Strictly convex in logits |
| Linear SVM | ✓ Convex (strongly with L2) | Hinge loss + regularization | Quadratic program |
| Kernel SVM | ✓ Convex | Hinge + kernel regularizer | Convex in dual formulation |
| Naive Bayes | — N/A | Generative model | No optimization; closed-form MLE |
| k-Means Clustering | ✗ Non-convex | Sum of squared distances | Discrete assignments break convexity |
| Gaussian Mixture Models | ✗ Non-convex | Log-likelihood | Multi-modal likelihood surface |
| Matrix Factorization | ✗ Non-convex | Squared error in UV^T | Bilinear form is non-convex |
| Neural Networks | ✗ Non-convex | Various | Composition of nonlinearities |
| Decision Trees | — N/A | Various impurity measures | Discrete structure; not continuous optimization |
Notice the pattern: linear models with convex losses are convex. Adding L2 regularization makes them strongly convex. The moment we introduce nonlinear feature learning (neural networks), latent variables (mixtures, factorizations), or discrete decisions (clustering, trees), convexity typically breaks.
Understanding why certain models are convex builds intuition for designing new convex formulations.
Linear Regression:
Objective: $f(w) = \frac{1}{2n} |Xw - y|^2 = \frac{1}{2n} (w^T X^T X w - 2w^T X^T y + y^T y)$
This is a quadratic function of $w$ with Hessian:
$$\nabla^2 f(w) = \frac{1}{n} X^T X$$
Since $X^T X$ is always positive semidefinite (it's a Gram matrix), the objective is convex. If $X$ has full column rank, $X^T X$ is positive definite, making $f$ strictly convex with unique solution.
Logistic Regression:
Objective: $f(w) = \frac{1}{n} \sum_{i=1}^{n} \log(1 + e^{-y_i x_i^T w})$
The Hessian is:
$$\nabla^2 f(w) = \frac{1}{n} \sum_{i=1}^{n} \sigma_i (1 - \sigma_i) x_i x_i^T = \frac{1}{n} X^T D X$$
where $\sigma_i = \sigma(y_i x_i^T w) \in (0, 1)$ and $D$ is diagonal with positive entries $\sigma_i(1 - \sigma_i)$.
Since $D \succ 0$, we have $X^T D X \succeq 0$, proving convexity. With L2 regularization, the Hessian becomes $X^T D X + \lambda I \succ 0$, guaranteeing strong convexity.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
import numpy as npfrom scipy.linalg import eigvalsh def verify_logistic_convexity(X, y, w, lambda_reg=0.0): """ Verify logistic regression Hessian is PSD (convex) or PD (strongly convex). """ n, d = X.shape # Compute sigmoid activations z = y * (X @ w) sigma = 1 / (1 + np.exp(-z)) # Diagonal weights: sigma * (1 - sigma) D = sigma * (1 - sigma) # Hessian: X^T D X / n + lambda * I H = (X.T @ np.diag(D) @ X) / n + lambda_reg * np.eye(d) eigenvalues = eigvalsh(H) min_eig = np.min(eigenvalues) return { 'hessian': H, 'eigenvalues': eigenvalues, 'min_eigenvalue': min_eig, 'is_convex': min_eig >= -1e-10, 'is_strongly_convex': min_eig > 1e-10, 'strong_convexity_param': max(0, min_eig) } def verify_linear_svm_convexity(X, lambda_reg=0.01): """ Verify L2-regularized linear SVM is strongly convex. Objective: (1/n) sum_i max(0, 1 - y_i w^T x_i) + (lambda/2) ||w||^2 The hinge loss max(0, 1 - y_i w^T x_i) is convex but not differentiable. The L2 term adds lambda * I to the effective Hessian. """ n, d = X.shape # The subgradient-based analysis shows SVM is lambda-strongly convex # due to the L2 regularizer return { 'strong_convexity_param': lambda_reg, 'is_strongly_convex': lambda_reg > 0, 'note': 'Hinge loss is convex; L2 regularizer adds strong convexity' } # Demonp.random.seed(42)n, d = 100, 10X = np.random.randn(n, d)y = np.sign(np.random.randn(n))w = np.zeros(d) print("Logistic Regression (unregularized):")result = verify_logistic_convexity(X, y, w, lambda_reg=0.0)print(f" Convex: {result['is_convex']}, Strongly convex: {result['is_strongly_convex']}")print(f" Min eigenvalue: {result['min_eigenvalue']:.6f}") print("\nLogistic Regression (L2 regularized, λ=0.1):")result = verify_logistic_convexity(X, y, w, lambda_reg=0.1)print(f" Convex: {result['is_convex']}, Strongly convex: {result['is_strongly_convex']}")print(f" Strong convexity parameter: {result['strong_convexity_param']:.6f}")Support Vector Machines:
The SVM primal objective is:
$$\min_{w, b} \frac{1}{2}|w|^2 + C \sum_{i=1}^{n} \max(0, 1 - y_i(w^T x_i + b))$$
This is a sum of:
Sum of convex functions is convex, so SVM is convex. The $\frac{1}{2}|w|^2$ term makes it strongly convex in $w$.
The dual formulation is a quadratic program:
$$\max_{\alpha} \sum_i \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j K(x_i, x_j)$$
subject to $0 \leq \alpha_i \leq C$ and $\sum_i \alpha_i y_i = 0$. This is concave maximization (convex minimization of the negative) over a convex polytope.
Neural networks fundamentally break convexity. Understanding why is essential for appreciating both their power and their limitations.
The Core Issue: Composition of Nonlinearities
Consider a single hidden layer network:
$$f(x; W_1, W_2) = W_2 \cdot \text{ReLU}(W_1 x)$$
Even with a convex loss $\ell$, the composite $\ell(f(x; W_1, W_2), y)$ is non-convex in $(W_1, W_2)$.
Proof by Example:
For a 1-hidden-unit network $f(x; w_1, w_2) = w_2 \cdot \text{ReLU}(w_1 \cdot x)$:
With $x = 1$ and target $y = 1$, squared loss:
$$L(w_1, w_2) = (w_2 \cdot \max(0, w_1) - 1)^2$$
For $w_1 > 0$: $L = (w_1 w_2 - 1)^2$
The set ${(w_1, w_2) : w_1 w_2 = 1}$ contains infinitely many global minima (hyperbola). But a convex function can't have infinitely many minima unless it's constant along connecting lines.
Check: $(1, 1)$ and $(2, 0.5)$ both give $L = 0$. Midpoint $(1.5, 0.75)$: $L = (1.125 - 1)^2 = 0.0156 > 0$. The loss increases between minima—violating convexity.
Neural networks have inherent symmetries: permuting hidden units gives identical functions but different parameters. This creates infinitely many equivalent minima, which cannot all be global minima of a strictly convex function. Non-convexity is structural, not incidental.
Sources of Non-Convexity in Neural Networks:
Yet Neural Networks Often Train Successfully:
Despite non-convexity, deep learning works remarkably well in practice. Several factors contribute:
Overparameterization: With more parameters than data points, the optimization landscape becomes more benign—many paths lead to global minima.
No bad local minima (sometimes): For certain architectures and data distributions, all local minima have loss close to the global minimum.
Saddle points escape: SGD noise helps escape saddle points, which are more common than local minima in high dimensions.
Implicit regularization: SGD dynamics favor solutions with good generalization properties, not just any minimum.
Lottery ticket hypothesis: Sparse subnetworks within randomly initialized networks can achieve good performance—we just need to find them.
This is an active research area. We can't prove deep learning always works, but we're developing theoretical frameworks to explain when and why it does.
When the natural objective is non-convex, we can sometimes replace it with a convex surrogate—a convex function that approximates or bounds the original, enabling tractable optimization.
Example: 0-1 Loss to Convex Surrogates
The true classification objective is 0-1 loss: $\ell_{0-1}(y, \hat{y}) = \mathbf{1}[y \neq \hat{y}]$
This is non-convex (and non-continuous), making optimization intractable.
Standard practice: replace with a convex surrogate:
All these are convex, upper bound the 0-1 loss, and have the same minimizers (margin > 0 implies correct classification).
| Loss | Formula | Properties |
|---|---|---|
| 0-1 (original) | $\mathbf{1}[z < 0]$ | Non-convex, non-differentiable, combinatorial |
| Hinge | $\max(0, 1-z)$ | Convex, non-smooth at $z=1$, max-margin |
| Logistic | $\log(1 + e^{-z})$ | Convex, smooth, probabilistic interpretation |
| Exponential | $e^{-z}$ | Convex, smooth, sensitive to outliers |
| Squared hinge | $\max(0, 1-z)^2$ | Convex, smooth, differentiable everywhere |
A surrogate loss is calibrated if minimizing it over all functions also minimizes the 0-1 loss. Logistic, hinge, and exponential losses are all calibrated for binary classification—finding their optima gives Bayes-optimal classifiers.
Convex Relaxations:
Sometimes we can "relax" a non-convex constraint into a convex one:
Rank minimization → Nuclear norm:
The problem $\min \text{rank}(X)$ subject to constraints is non-convex (rank is discrete).
Convex relaxation: $\min |X|_* = \sum_i \sigma_i(X)$ (nuclear norm = sum of singular values).
Under certain conditions (restricted isometry property), the nuclear norm minimizer also minimizes rank—the relaxation is tight.
Sparse recovery → L1 minimization:
The problem $\min |w|_0$ (number of nonzeros) is non-convex.
Convex relaxation: $\min |w|_1$ (sum of absolute values).
Under RIP conditions, L1 minimization recovers the sparsest solution—basis pursuit success.
Semidefinite relaxations:
Quadratic constraints like $x^T Q x \leq 1$ with nonconvex $Q$ can be relaxed using semidefinite programming, lifting to higher dimensions where the problem becomes convex.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
import numpy as npimport matplotlib.pyplot as plt def zero_one_loss(z): """Original 0-1 loss: 1 if z < 0, else 0""" return (z < 0).astype(float) def hinge_loss(z): """Hinge loss: max(0, 1 - z)""" return np.maximum(0, 1 - z) def logistic_loss(z): """Logistic loss: log(1 + exp(-z))""" return np.log(1 + np.exp(-z)) def exponential_loss(z): """Exponential loss: exp(-z)""" return np.exp(-z) def squared_hinge_loss(z): """Squared hinge: max(0, 1-z)^2""" return np.maximum(0, 1 - z) ** 2 # Verify convexity of surrogatesdef verify_loss_convexity(loss_fn, z_range=(-3, 3), n_points=1000): """ Verify convexity by checking if loss lies below chords. """ z = np.linspace(z_range[0], z_range[1], n_points) loss = loss_fn(z) violations = 0 tests = 0 for _ in range(1000): i, j = np.random.randint(0, n_points, 2) if i == j: continue theta = np.random.uniform(0, 1) z_interp = theta * z[i] + (1 - theta) * z[j] loss_interp = theta * loss[i] + (1 - theta) * loss[j] loss_at_interp = loss_fn(z_interp) tests += 1 if loss_at_interp > loss_interp + 1e-10: violations += 1 return { 'tests': tests, 'violations': violations, 'is_convex': violations == 0 } print("Convexity verification:")for name, fn in [('Hinge', hinge_loss), ('Logistic', logistic_loss), ('Exponential', exponential_loss), ('Squared Hinge', squared_hinge_loss)]: result = verify_loss_convexity(fn) print(f" {name}: {result['is_convex']} ({result['violations']} violations)") # All surrogates are upper bounds on 0-1 lossz_test = np.linspace(-3, 3, 100)print("\nUpper bound verification at z=-0.5 (should predict wrong):")z_val = -0.5print(f" 0-1 loss: {zero_one_loss(z_val)}")print(f" Hinge: {hinge_loss(z_val):.3f}")print(f" Logistic: {logistic_loss(z_val):.3f}")print(f" All surrogates >= 0-1 loss: {all([ hinge_loss(z_val) >= zero_one_loss(z_val), logistic_loss(z_val) >= zero_one_loss(z_val), exponential_loss(z_val) >= zero_one_loss(z_val)])}")Given the choice between convex and non-convex formulations, when should you insist on convexity? This depends on your requirements for reliability, interpretability, and performance.
In practice, many ML practitioners use non-convex methods (deep learning) for accuracy and convex methods (logistic regression) for interpretability and reliability. The choice depends on the specific application, not a universal rule.
Hybrid Approaches:
Sometimes you can get the best of both worlds:
Convex initialization for non-convex problems: Train a logistic regression, use its weights to initialize a neural network. Starting near a good solution helps non-convex optimization.
Convex outer loop, non-convex inner loop: In meta-learning, the outer optimization might be convex while the inner adaptation is non-convex.
Neural network features + convex classifier: Use a pretrained network to extract features, then train a convex classifier (SVM, logistic regression) on top.
Convex constraints in non-convex optimization: Even in neural network training, we can enforce convex constraints on the feasible region (e.g., norm bounds on weights).
When designing ML systems, you can often engineer convexity into your objective. Here are key principles and techniques.
Principle 1: Convex Loss + Linear Model = Convex Problem
If your model is linear in parameters (even with fixed nonlinear features) and your loss is convex, the composite is convex.
$$\min_w \sum_{i=1}^{n} \ell(\phi(x_i)^T w, y_i) + R(w)$$
With convex $\ell$ and convex regularizer $R$, this is convex in $w$.
Principle 2: Use Kernel Methods for Nonlinearity
Kernel methods give nonlinear decision boundaries while maintaining convex objectives:
$$\min_\alpha \sum_{i,j} \alpha_i \alpha_j K(x_i, x_j) + \text{linear terms}$$
The kernel implicitly maps to high-dimensional feature space, but optimization remains in the convex dual.
Principle 3: Regularization for Strong Convexity
Add L2 regularization to ensure strong convexity:
$$\min_w f(w) + \frac{\lambda}{2}|w|^2$$
This guarantees unique solution and linear convergence rate $O((1 - \lambda/L)^k)$.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
import numpy as npfrom scipy.linalg import eigvalsh class ConvexMLObjective: """ Framework for building and verifying convex ML objectives. Objective: (1/n) sum_i loss(phi(x_i)^T w, y_i) + regularization(w) """ def __init__(self, loss_fn, loss_hessian_fn, regularizer_fn, reg_hessian_fn): self.loss_fn = loss_fn self.loss_hessian_fn = loss_hessian_fn self.regularizer_fn = regularizer_fn self.reg_hessian_fn = reg_hessian_fn def objective(self, w, Phi, y): """Compute objective value""" predictions = Phi @ w loss = np.mean([self.loss_fn(p, yi) for p, yi in zip(predictions, y)]) reg = self.regularizer_fn(w) return loss + reg def verify_convexity(self, w, Phi, y): """Verify objective is convex at point w""" n, d = Phi.shape # Compute total Hessian H_loss = np.zeros((d, d)) for i in range(n): # Loss Hessian contribution from sample i h_i = self.loss_hessian_fn(Phi[i] @ w, y[i]) H_loss += h_i * np.outer(Phi[i], Phi[i]) H_loss /= n H_reg = self.reg_hessian_fn(w) H_total = H_loss + H_reg eigenvalues = eigvalsh(H_total) return { 'min_eigenvalue': np.min(eigenvalues), 'is_convex': np.min(eigenvalues) >= -1e-10, 'is_strongly_convex': np.min(eigenvalues) > 1e-10 } # Example: L2-regularized logistic regressiondef logistic_loss(pred, y): z = y * pred return np.log(1 + np.exp(-z)) def logistic_loss_hessian(pred, y): z = y * pred sigma = 1 / (1 + np.exp(-z)) return sigma * (1 - sigma) def l2_regularizer(w, lambda_reg=0.1): return (lambda_reg / 2) * np.sum(w ** 2) def l2_hessian(w, lambda_reg=0.1): return lambda_reg * np.eye(len(w)) # Create and verifynp.random.seed(42)n, d = 100, 10Phi = np.random.randn(n, d) # Feature matrix (could be kernel-based)y = np.sign(np.random.randn(n))w = np.random.randn(d) objective = ConvexMLObjective( logistic_loss, logistic_loss_hessian, lambda w: l2_regularizer(w, 0.1), lambda w: l2_hessian(w, 0.1)) result = objective.verify_convexity(w, Phi, y)print(f"Objective value: {objective.objective(w, Phi, y):.4f}")print(f"Is convex: {result['is_convex']}")print(f"Is strongly convex: {result['is_strongly_convex']}")print(f"Min eigenvalue: {result['min_eigenvalue']:.6f}")Let's distill practical guidance for ML practitioners navigating the convex vs. non-convex landscape.
For Convex Problems:
For Non-Convex Problems (Neural Networks):
For 80% of practical ML problems, convex models (logistic regression, linear SVM, ridge regression) provide 80% of the achievable accuracy with 100% reliability. Reserve non-convex methods for the 20% of problems where that last 20% of accuracy truly matters.
We've mapped the machine learning landscape through the lens of convexity. Let's consolidate the key insights:
Module Complete:
This concludes our exploration of convex optimization. You now understand:
These foundations will serve you throughout your ML journey, whether you're deriving new algorithms, debugging optimization failures, or choosing between competing approaches.
You've mastered the mathematical foundations of convex optimization and their applications to machine learning. Armed with this knowledge, you can confidently analyze whether ML problems are tractable, design convex objectives when possible, and understand the trade-offs when they're not. Next, we move to gradient descent—the workhorse algorithm that turns convex optimization theory into practical ML training.