Loading learning content...
The true power of vectors emerges when we combine them. A linear combination takes multiple vectors, scales each by a coefficient, and adds them together to produce a new vector. This seemingly simple operation is the engine driving nearly all of linear algebra and machine learning.
When a neural network processes input, it computes linear combinations of features. When PCA reduces dimensionality, it finds optimal linear combinations. When we solve linear equations or fit regression models, we're searching for the right linear combination. Understanding linear combinations deeply is essential for understanding how data transforms and flows through machine learning systems.
This page explores linear combinations rigorously—their definition, computation, geometric meaning, and role as the foundation for span and linear independence.
By the end of this page, you will understand the formal definition and computation of linear combinations, develop geometric intuition for how vectors combine, connect linear combinations to machine learning computations (weighted sums, neural layers), recognize linear combinations in matrix-vector multiplication, and understand why linear combinations are 'linear' and what that constraint means.
Formal definition:
Given vectors $\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_k$ (all in the same vector space $\mathbb{R}^n$) and scalars $c_1, c_2, \ldots, c_k$ (real numbers), a linear combination of these vectors is:
$$\mathbf{w} = c_1 \mathbf{v}_1 + c_2 \mathbf{v}_2 + \cdots + c_k \mathbf{v}k = \sum{i=1}^{k} c_i \mathbf{v}_i$$
The scalars $c_1, c_2, \ldots, c_k$ are called coefficients, weights, or coordinates (depending on context).
Example:
Given $\mathbf{v}_1 = (1, 0, 2)$ and $\mathbf{v}_2 = (0, 1, -1)$ with coefficients $c_1 = 3$ and $c_2 = -2$:
$$\mathbf{w} = 3\mathbf{v}_1 + (-2)\mathbf{v}_2 = 3(1, 0, 2) + (-2)(0, 1, -1)$$ $$= (3, 0, 6) + (0, -2, 2) = (3, -2, 8)$$
The combination is called 'linear' because it involves only two operations: scaling (multiplication by scalars) and addition. No squaring, no products of components, no nonlinear functions. This restriction to linear operations is what makes linear algebra tractable and is why linearity is so important throughout mathematics.
Key observations:
12345678910111213141516171819202122232425262728293031323334353637383940414243
import numpy as np # Simple linear combination examplev1 = np.array([1, 0, 2])v2 = np.array([0, 1, -1])c1, c2 = 3, -2 # Compute linear combinationw = c1 * v1 + c2 * v2print(f"v1 = {v1}")print(f"v2 = {v2}")print(f"3*v1 + (-2)*v2 = {w}") # [3, -2, 8] # Step by stepprint(f"\nStep by step:")print(f"3 * v1 = {3 * v1}")print(f"-2 * v2 = {-2 * v2}")print(f"Sum = {3 * v1 + (-2) * v2}") # More vectorsv3 = np.array([2, 2, 0])c3 = 0.5w2 = c1 * v1 + c2 * v2 + c3 * v3print(f"\n3*v1 - 2*v2 + 0.5*v3 = {w2}") # General function for linear combinationsdef linear_combination(vectors, coefficients): """Compute linear combination of vectors with given coefficients.""" assert len(vectors) == len(coefficients), "Must have same number of vectors and coefficients" result = np.zeros_like(vectors[0], dtype=float) for v, c in zip(vectors, coefficients): result += c * v return result # Test the functionvectors = [v1, v2, v3]coeffs = [1, 1, 1]print(f"\nv1 + v2 + v3 = {linear_combination(vectors, coeffs)}") # Using matrix form (more efficient)V = np.column_stack(vectors) # Vectors as columnsc = np.array(coeffs)print(f"Matrix form result: {V @ c}")Linear combinations have a beautiful geometric interpretation that provides intuition even in high dimensions.
Two vectors in 2D:
Consider two non-parallel vectors $\mathbf{v}_1$ and $\mathbf{v}_2$ in $\mathbb{R}^2$. Their linear combination:
$$\mathbf{w} = c_1 \mathbf{v}_1 + c_2 \mathbf{v}_2$$
can reach any point in the plane by choosing appropriate $c_1$ and $c_2$.
Visualization process:
By varying $c_1$ and $c_2$ over all real numbers, we sweep out the entire 2D plane.
For any target point, we can find coefficients by completing a parallelogram with sides parallel to v₁ and v₂. The coefficients tell us how many 'units' of each vector direction we need. This is why non-parallel vectors are essential—parallel vectors can only reach a line, not the full plane.
What linear combinations can reach:
The set of all vectors reachable by linear combinations of given vectors defines what we call the span (covered in detail next page). For now, key observations:
With 1 non-zero vector in $\mathbb{R}^n$:
With 2 non-parallel vectors in $\mathbb{R}^n$:
With 3 non-coplanar vectors in $\mathbb{R}^n$:
Constraint: Through the origin
Linear combinations always pass through the origin (choosing all coefficients = 0). This is a defining property—we can't reach points "offset" from the origin using only linear combinations of vectors based at the origin.
| Vectors | Condition | Reach (in ℝⁿ) |
|---|---|---|
| 1 vector | Non-zero | Line through origin |
| 2 vectors | Not parallel (linearly independent) | Plane through origin |
| 3 vectors | Not coplanar (linearly independent) | 3D subspace through origin |
| k vectors | Linearly independent | k-dimensional subspace through origin |
| n vectors in ℝⁿ | Linearly independent (basis) | Entire ℝⁿ |
1234567891011121314151617181920212223242526272829303132333435363738394041
import numpy as npimport matplotlib.pyplot as plt # Demonstration: expressing any 2D point as linear combinationv1 = np.array([1, 0]) # x-axis directionv2 = np.array([0, 1]) # y-axis direction # Any point (a, b) is exactly a*v1 + b*v2target = np.array([3, 2])c1, c2 = 3, 2 # Coefficients equal to coordinates!result = c1 * v1 + c2 * v2print(f"Target: {target}")print(f"Reconstructed: {result}")print(f"Match: {np.allclose(target, result)}") # With non-standard basis vectorsv1 = np.array([2, 1])v2 = np.array([1, 3]) # Find coefficients to reach (5, 11)target = np.array([5, 11]) # We need to solve: c1*v1 + c2*v2 = target# This is a system of linear equationsV = np.column_stack([v1, v2])coeffs = np.linalg.solve(V, target)print(f"\nTo reach {target} using v1={v1}, v2={v2}:")print(f"Coefficients: c1={coeffs[0]:.4f}, c2={coeffs[1]:.4f}")print(f"Verification: c1*v1 + c2*v2 = {coeffs[0]*v1 + coeffs[1]*v2}") # What happens with parallel vectors?v1 = np.array([1, 2])v2 = np.array([2, 4]) # v2 = 2 * v1 (parallel!) # Can only reach points on the line through v1target_on_line = np.array([3, 6]) # = 3 * v1, reachabletarget_off_line = np.array([1, 1]) # Not on line v1, unreachable! print(f"\nWith parallel vectors v1={v1}, v2={v2}:")print(f"{target_on_line} = 3*v1 (reachable)")print(f"{target_off_line} is OFF the line v1 (unreachable by linear combo)")Linear combinations are everywhere in machine learning. Recognizing them helps you understand what models are actually computing.
Neural Network Layers:
Each neuron computes a linear combination of its inputs (followed by a nonlinear activation):
$$z = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b = \mathbf{w}^\top \mathbf{x} + b$$
The weights $\mathbf{w}$ are coefficients, and the inputs $\mathbf{x}$ are the vectors being combined (or vice versa, depending on perspective).
Fully Connected Layer:
A layer with multiple neurons computes multiple linear combinations simultaneously:
$$\mathbf{z} = W \mathbf{x} + \mathbf{b}$$
Each row of $W$ defines one linear combination—the weights for one output neuron.
A linear combination of linear combinations is still just a linear combination! Without activation functions, stacking layers would collapse to a single linear transformation. Nonlinear activations break this collapse, enabling neural networks to approximate complex functions.
More ML contexts:
| ML Context | What's Combined | Coefficients | Result |
|---|---|---|---|
| Linear Regression | Feature values | Model weights | Prediction |
| PCA | Original features | Principal component loadings | Reduced features |
| Word Embeddings | One-hot vectors | Embedding matrix rows | Dense word vector |
| Attention | Value vectors | Attention weights | Context vector |
| Ensemble Methods | Base model predictions | Ensemble weights | Final prediction |
| Kernel Methods | Training examples | Dual coefficients (α) | Decision boundary |
Linear Regression as Linear Combination:
$$\hat{y} = w_0 + w_1 x_1 + w_2 x_2 + \cdots + w_n x_n$$
This is a linear combination of features $[1, x_1, x_2, \ldots, x_n]$ (including the bias term as $x_0 = 1$) with coefficients $[w_0, w_1, \ldots, w_n]$.
Attention Mechanism:
Transformer attention computes a weighted average of value vectors:
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V$$
The softmax output provides coefficients (attention weights) for linearly combining value vectors. Each output is a linear combination of all value vectors, weighted by relevance.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
import numpy as np # Linear regression prediction as linear combinationdef linear_regression_predict(X, weights, bias): """Prediction is linear combination of features.""" return X @ weights + bias # Example: house price prediction# Features: [sqft, bedrooms, age]X = np.array([ [1500, 3, 10], [2000, 4, 5], [1200, 2, 20],])weights = np.array([0.1, 10.0, -0.5]) # Learned weightsbias = 50 # Learned bias predictions = linear_regression_predict(X, weights, bias)print("Linear Regression as Linear Combination:")print(f"Features shape: {X.shape}")print(f"Weights: {weights}")print(f"Predictions: {predictions}") # Single prediction decomposedx = X[0]print(f"\nFor house with features {x}:")print(f" 0.1 * {x[0]} (sqft term) = {0.1 * x[0]}")print(f" + 10 * {x[1]} (bedroom term) = {10 * x[1]}")print(f" + -0.5 * {x[2]} (age term) = {-0.5 * x[2]}")print(f" + {bias} (bias) = {bias}")print(f" = {predictions[0]}") # Neural network layer as linear combinationsdef dense_layer(X, W, b): """Each output neuron is a linear combination of inputs.""" return X @ W + b # Plus nonlinearity in practice # Input: 3 features, Output: 4 neuronsX = np.array([[1, 2, 3]]) # 1 sample, 3 featuresW = np.random.randn(3, 4) # 3 input, 4 outputb = np.zeros(4) output = dense_layer(X, W, b)print(f"\nNeural Network Layer:")print(f"Input shape: {X.shape}")print(f"Weight shape: {W.shape}")print(f"Output shape: {output.shape}")print(f"Each output = linear combination of 3 inputs") # Simplified attention: weighted combination of valuesvalues = np.array([ [1, 0], [0, 1], [1, 1],]) # 3 value vectorsattention_weights = np.array([0.5, 0.3, 0.2]) # Softmax output context = attention_weights @ valuesprint(f"\nAttention as Linear Combination:")print(f"Values:\n{values}")print(f"Attention weights: {attention_weights}")print(f"Context vector (weighted sum): {context}")There's a profound connection between linear combinations and matrix-vector multiplication. Understanding this connection illuminates both concepts.
Column view of matrix-vector multiplication:
When we multiply a matrix $A$ by a vector $\mathbf{x}$:
$$A\mathbf{x} = \begin{bmatrix} | & | & & | \ \mathbf{a}_1 & \mathbf{a}_2 & \cdots & \mathbf{a}_n \ | & | & & | \end{bmatrix} \begin{bmatrix} x_1 \ x_2 \ \vdots \ x_n \end{bmatrix} = x_1 \mathbf{a}_1 + x_2 \mathbf{a}_2 + \cdots + x_n \mathbf{a}_n$$
The result is a linear combination of the columns of $A$, with coefficients from $\mathbf{x}$.
This is called the column picture of matrix-vector multiplication.
The insight that Ax is a linear combination of A's columns is one of the most important ideas in linear algebra. It means the columns of A determine what outputs are possible—the range of the transformation. This perspective is essential for understanding why matrices have ranks and why some systems have no solution.
Example:
$$\begin{bmatrix} 1 & 3 \ 2 & 1 \ 0 & 2 \end{bmatrix} \begin{bmatrix} 2 \ 4 \end{bmatrix} = 2 \begin{bmatrix} 1 \ 2 \ 0 \end{bmatrix} + 4 \begin{bmatrix} 3 \ 1 \ 2 \end{bmatrix} = \begin{bmatrix} 2 \ 4 \ 0 \end{bmatrix} + \begin{bmatrix} 12 \ 4 \ 8 \end{bmatrix} = \begin{bmatrix} 14 \ 8 \ 8 \end{bmatrix}$$
Row view (for comparison):
The more familiar "row view" computes each output as a dot product:
$$(A\mathbf{x})_i = \mathbf{a}_i^\top \cdot \mathbf{x} = \text{row } i \text{ dotted with } \mathbf{x}$$
Both views give the same answer, but the column view reveals the linear combination structure.
Why this matters:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
import numpy as np # Matrix-vector multiplication as linear combinationA = np.array([ [1, 3], [2, 1], [0, 2]])x = np.array([2, 4]) # Standard computationresult = A @ xprint(f"A @ x = {result}") # Column view: linear combination of columnscol1 = A[:, 0] # First columncol2 = A[:, 1] # Second columncolumn_view = x[0] * col1 + x[1] * col2print(f"\nColumn view:")print(f"{x[0]} * {col1} + {x[1]} * {col2}")print(f"= {x[0] * col1} + {x[1] * col2}")print(f"= {column_view}") # Row view: dot productsrow_view = np.array([ np.dot(A[0], x), np.dot(A[1], x), np.dot(A[2], x)])print(f"\nRow view:")for i in range(3): print(f"Row {i} · x = {A[i]} · {x} = {row_view[i]}") # All three methods give same answerprint(f"\nAll equal: {np.allclose(result, column_view) and np.allclose(result, row_view)}") # The question: can we reach b with some linear combination?b_reachable = result # Same as A @ [2, 4]b_unreachable = np.array([0, 0, 1]) # Is this in span of columns? # For a 3x2 matrix, columns span at most a 2D plane in R^3# Not all R^3 vectors are reachable!print(f"\nColumn space of A spans a 2D plane in R^3")print(f"{b_reachable} is reachable (linear combination with x={x})") # Check if b is in column space (approximate)# Full solution requires least squares or checking residualx_approx, residuals, rank, s = np.linalg.lstsq(A, b_unreachable, rcond=None)if len(residuals) > 0 and residuals[0] > 1e-10: print(f"{b_unreachable} is NOT exactly reachable (residual = {residuals[0]:.4f})")Certain restricted types of linear combinations have special names and properties that appear frequently in ML.
Affine combination:
An affine combination requires coefficients to sum to 1:
$$\mathbf{w} = c_1 \mathbf{v}_1 + c_2 \mathbf{v}_2 + \cdots + c_k \mathbf{v}k \quad \text{where } \sum{i=1}^{k} c_i = 1$$
Geometrically, affine combinations of two points give all points on the line through them (not just through the origin). Affine combinations of three non-collinear points give all points on the plane through them.
Example:
Linear combinations must pass through the origin. Affine combinations can pass through any point. When we add a bias term to a linear model (y = Wx + b), we're moving from linear to affine. The term 'linear regression' is technically a misnomer—it's affine regression!
Convex combination:
A convex combination requires coefficients to sum to 1 and all be non-negative:
$$\mathbf{w} = c_1 \mathbf{v}_1 + c_2 \mathbf{v}_2 + \cdots + c_k \mathbf{v}k \quad \text{where } \sum{i=1}^{k} c_i = 1 \text{ and } c_i \geq 0 \text{ for all } i$$
Geometrically, convex combinations give points between the original vectors—inside the convex hull.
Example:
| Type | Coefficient Constraint | Geometric Meaning | ML Example |
|---|---|---|---|
| Linear combination | None | All reachable points through origin | Neural network layer (without bias) |
| Affine combination | $\sum c_i = 1$ | All reachable points (any location) | Linear regression with bias |
| Convex combination | $\sum c_i = 1$, $c_i \geq 0$ | Interior of convex hull | Mixture models, attention weights |
| Conic combination | $c_i \geq 0$ | Cone with apex at origin | Non-negative matrix factorization |
In machine learning:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
import numpy as np v1 = np.array([1, 0])v2 = np.array([0, 1])v3 = np.array([1, 1]) # Linear combination (no constraints)linear = 2 * v1 + (-3) * v2 # Any coefficientsprint(f"Linear combination: 2*v1 - 3*v2 = {linear}") # Affine combination (sum to 1)affine = 0.3 * v1 + 0.7 * v2 # 0.3 + 0.7 = 1print(f"Affine combination: 0.3*v1 + 0.7*v2 = {affine}") # But affine allows negative (still sums to 1)affine_neg = 1.5 * v1 + (-0.5) * v2 # 1.5 - 0.5 = 1print(f"Affine with negative: 1.5*v1 - 0.5*v2 = {affine_neg}") # Convex combination (sum to 1, all non-negative)convex = 0.4 * v1 + 0.35 * v2 + 0.25 * v3 # All >= 0, sum = 1print(f"Convex combination: 0.4*v1 + 0.35*v2 + 0.25*v3 = {convex}") # Attention as convex combinationdef softmax(x): exp_x = np.exp(x - np.max(x)) return exp_x / exp_x.sum() # Value vectors (e.g., from transformer)values = np.array([ [1, 2, 3], [4, 5, 6], [7, 8, 9],]) # Raw attention scoresraw_scores = np.array([2.0, 1.0, 0.5])attention_weights = softmax(raw_scores) print(f"\nAttention mechanism:")print(f"Attention weights: {attention_weights}")print(f"Sum of weights: {attention_weights.sum():.4f}")print(f"All non-negative: {np.all(attention_weights >= 0)}")print(f"=> This is a convex combination!") context = attention_weights @ valuesprint(f"Context vector: {context}") # Mixup data augmentation (convex combination of samples)x1 = np.array([100, 200, 300]) # One training samplex2 = np.array([10, 20, 30]) # Another samplelambda_mix = 0.7 # Mixup coefficient x_mixed = lambda_mix * x1 + (1 - lambda_mix) * x2print(f"\nMixup augmentation:")print(f"Mixed sample: {x_mixed}")Linear combinations preserve certain structures—a property called closure that's fundamental to linear algebra.
Closure property:
The set of all linear combinations of a set of vectors is closed under:
In other words, the set of all linear combinations of given vectors is itself a vector space (or subspace of the ambient space).
Why this matters:
Closure means working within the span is "safe"—we can add and scale without leaving the set. This is why subspaces are natural objects in linear algebra.
Non-examples (illustrating closure failure):
A subspace must be closed under addition and scalar multiplication, and must contain the zero vector. This is more restrictive than an arbitrary set of vectors. Understanding when a set is a subspace (and when it isn't) is crucial for linear algebra.
Linearity of the linear combination operation:
The map that takes coefficients to their linear combination is linear:
If we define $T: \mathbb{R}^k \to \mathbb{R}^n$ by: $$T(\mathbf{c}) = c_1 \mathbf{v}_1 + c_2 \mathbf{v}_2 + \cdots + c_k \mathbf{v}_k$$
Then $T$ is a linear transformation:
This linearity is precisely what makes linear combinations behave so predictably.
123456789101112131415161718192021222324252627282930313233343536
import numpy as np # Demonstrate closure of spanv1 = np.array([1, 0, 0])v2 = np.array([0, 1, 0]) # Any linear combination of v1, v2 is in the xy-plane (z=0)lc1 = 3 * v1 + 2 * v2 # [3, 2, 0]lc2 = -1 * v1 + 4 * v2 # [-1, 4, 0] print("Closure demonstration:")print(f"lc1 = {lc1}")print(f"lc2 = {lc2}") # Sum of two linear combinations is still in spanlc_sum = lc1 + lc2print(f"lc1 + lc2 = {lc_sum}")print(f"z-component is still 0: {lc_sum[2] == 0}") # Scalar multiple of linear combination is still in spanlc_scaled = 5 * lc1print(f"5 * lc1 = {lc_scaled}")print(f"z-component is still 0: {lc_scaled[2] == 0}") # Non-example: unit vectors are NOT closedu1 = np.array([1, 0])u2 = np.array([0, 1])u_sum = u1 + u2print(f"\nUnit vector closure failure:")print(f"||u1|| = {np.linalg.norm(u1)}, ||u2|| = {np.linalg.norm(u2)}")print(f"u1 + u2 = {u_sum}")print(f"||u1 + u2|| = {np.linalg.norm(u_sum):.4f} ≠ 1") # The zero vector is always in the span (all coefficients = 0)zero = 0 * v1 + 0 * v2print(f"\nZero vector in span: {zero}")Implementing linear combinations efficiently is crucial for ML performance.
Vectorized computation:
Never compute linear combinations with explicit Python loops. NumPy's vectorized operations are orders of magnitude faster:
# SLOW: explicit loop
result = np.zeros(n)
for i, (v, c) in enumerate(zip(vectors, coeffs)):
result += c * v
# FAST: matrix multiplication
V = np.column_stack(vectors) # Vectors as columns
result = V @ coeffs
The matrix form leverages optimized BLAS routines and parallelism.
When coefficients vary widely in magnitude (e.g., 1e-10 and 1e10), floating-point errors can accumulate. In such cases, consider: (1) scaling/normalizing vectors, (2) using higher precision (float64), (3) sorting terms by magnitude before summing, or (4) using compensated summation algorithms like Kahan summation.
Memory considerations:
Broadcasting in NumPy:
NumPy's broadcasting rules allow elegant linear combination expressions:
123456789101112131415161718192021222324252627282930313233343536373839
import numpy as npimport time n_vectors = 100dim = 1000 # Generate random vectors and coefficientsvectors = [np.random.randn(dim) for _ in range(n_vectors)]coeffs = np.random.randn(n_vectors) # Method 1: Explicit loop (SLOW)start = time.perf_counter()result_loop = np.zeros(dim)for v, c in zip(vectors, coeffs): result_loop += c * vloop_time = time.perf_counter() - start # Method 2: Matrix multiplication (FAST)V = np.column_stack(vectors) # Shape: (dim, n_vectors)start = time.perf_counter()result_matrix = V @ coeffsmatrix_time = time.perf_counter() - start print(f"Loop time: {loop_time*1000:.4f} ms")print(f"Matrix time: {matrix_time*1000:.4f} ms")print(f"Speedup: {loop_time/matrix_time:.1f}x")print(f"Results match: {np.allclose(result_loop, result_matrix)}") # Broadcasting example: weight each vector by its coefficient# coeffs[:, np.newaxis] broadcasts across columnsV_shaped = np.array(vectors).T # (dim, n_vectors)coeffs_shaped = coeffs[np.newaxis, :] # (1, n_vectors) for broadcastingweighted_vectors = V_shaped * coeffs_shaped # Each column scaledresult_broadcast = weighted_vectors.sum(axis=1)print(f"Broadcast result matches: {np.allclose(result_loop, result_broadcast)}") # einsum for complex operationsresult_einsum = np.einsum('ij,j->i', V, coeffs)print(f"Einsum result matches: {np.allclose(result_loop, result_einsum)}")Linear combinations are the fundamental building blocks of linear algebra and machine learning. We've covered their definition, geometric meaning, and computational aspects.
What's next:
Now we ask the crucial question: given a set of vectors, what can we reach with linear combinations (the span)? And when do we have enough vectors to reach everything, without redundancy (the concept of linear independence)? These ideas determine when systems of equations have solutions and when transformations are invertible.
You now understand linear combinations—the fundamental operation for building vectors from vectors. This concept is the bridge to span and linear independence, which tell us what vectors can represent and when vectors are 'redundant.'