Loading content...
When you first encounter matrices in mathematics, they often appear as rectangular grids of numbers—notation to be manipulated according to seemingly arbitrary rules. You learn to multiply matrices using the 'row-by-column' procedure without ever understanding why that procedure exists or what it means.\n\nThis superficial understanding is a barrier to mastering machine learning. In ML, matrices are everywhere: weight matrices in neural networks, covariance matrices in statistics, transformation matrices in data preprocessing, kernel matrices in support vector machines. Without geometric intuition, these are just symbols to shuffle around. With it, deep patterns emerge that make complex algorithms feel natural.\n\nThe central insight: A matrix is not a table of numbers. A matrix is a complete description of a linear transformation—a function that takes vectors as input and produces transformed vectors as output, preserving the essential structure of the space.
By the end of this page, you will understand matrices as functions that transform space. You will see how every matrix encodes a specific geometric operation—stretching, rotating, reflecting, shearing, or projecting—and why this perspective is essential for comprehending machine learning algorithms at their deepest level.
Let's begin with a mental reset. Forget everything you know about matrix mechanics. Instead, consider the simplest possible question: What does a matrix do?\n\nA matrix A is a function that takes a vector x as input and produces a new vector y as output:\n\n$$\mathbf{y} = A\mathbf{x}$$\n\nThat's it. Everything else—all the rules for matrix multiplication, the definitions of determinants and eigenvalues, the entire apparatus of linear algebra—follows from understanding this one idea deeply.\n\nThe function analogy:\n\nJust as the function $f(x) = 2x$ takes a number and doubles it, a matrix takes a vector and transforms it according to a specific rule. The difference is that vectors inhabit multi-dimensional space, so the transformation is richer—it can affect direction as well as magnitude.\n\nConsider a 2×2 matrix acting on 2D vectors:\n\n$$A = \begin{bmatrix} 2 & 0 \\ 0 & 3 \end{bmatrix}, \quad \mathbf{x} = \begin{bmatrix} 1 \\ 1 \end{bmatrix}$$\n\n$$A\mathbf{x} = \begin{bmatrix} 2 \cdot 1 + 0 \cdot 1 \\ 0 \cdot 1 + 3 \cdot 1 \end{bmatrix} = \begin{bmatrix} 2 \\ 3 \end{bmatrix}$$\n\nThe input vector $(1, 1)$ becomes $(2, 3)$. The matrix stretched the x-component by factor 2 and the y-component by factor 3. This is scaling—one of the fundamental transformation types.
In a neural network, each layer performs exactly this operation: multiply input vector by weight matrix, producing output vector. The weight matrix IS the learned transformation. Understanding matrices as transformations means understanding what neural networks are actually doing—they're learning to warp input space until the problem becomes linearly separable.
Not every function on vectors is a linear transformation. The term "linear" imposes two strict constraints that together define the essence of what matrices can represent.\n\nA function $T: \mathbb{R}^n \to \mathbb{R}^m$ is a linear transformation if and only if:\n\n1. Additivity (Preservation of Addition): \n $T(\mathbf{u} + \mathbf{v}) = T(\mathbf{u}) + T(\mathbf{v})$\n\n2. Homogeneity (Preservation of Scalar Multiplication): \n $T(c\mathbf{u}) = cT(\mathbf{u})$\n\nThese can be combined into a single condition:\n$$T(c_1\mathbf{u} + c_2\mathbf{v}) = c_1 T(\mathbf{u}) + c_2 T(\mathbf{v})$$\n\nThis says: the transformation of a linear combination equals the linear combination of the transformations. In prose: linear transformations preserve the structure of vector spaces.
These algebraic conditions have profound geometric consequences. Any linear transformation must: (1) map the origin to the origin, (2) map straight lines to straight lines (or points), (3) preserve parallelism—if lines are parallel before transformation, they remain parallel after. No curving, no bending, no translation.
Examples of linear transformations:\n- Rotation around the origin\n- Scaling (uniform or non-uniform)\n- Reflection across a line through the origin\n- Shearing\n- Projection onto a subspace\n\nExamples of NON-linear transformations:\n- Translation: $T(\mathbf{x}) = \mathbf{x} + \mathbf{b}$ — violates $T(\mathbf{0}) = \mathbf{0}$\n- Affine: $T(\mathbf{x}) = A\mathbf{x} + \mathbf{b}$ — linear part plus translation\n- Any function involving powers, products of components, etc.\n\nThe fundamental theorem (informal):\n\nEvery linear transformation between finite-dimensional vector spaces can be represented by a matrix, and every matrix represents a linear transformation. The matrix IS the transformation, encoded in a specific way.
| Transformation | Formula | Linear? | Why |
|---|---|---|---|
| Scaling | $T(\mathbf{x}) = 2\mathbf{x}$ | ✅ Yes | Both properties satisfied |
| Rotation | $T(\mathbf{x}) = R\mathbf{x}$ (R = rotation matrix) | ✅ Yes | Preserves structure |
| Translation | $T(\mathbf{x}) = \mathbf{x} + (1, 2)$ | ❌ No | $T(\mathbf{0}) \neq \mathbf{0}$ |
| Squaring | $T(x, y) = (x^2, y^2)$ | ❌ No | $T(2\mathbf{x}) \neq 2T(\mathbf{x})$ |
| ReLU | $T(x) = \max(0, x)$ | ❌ No | Not additive |
Here's the key insight that connects algebra to geometry: the columns of a matrix tell you where the standard basis vectors go.\n\nIn 2D, the standard basis vectors are:\n$$\mathbf{e}_1 = \begin{bmatrix} 1 \\ 0 \end{bmatrix}, \quad \mathbf{e}_2 = \begin{bmatrix} 0 \\ 1 \end{bmatrix}$$\n\nThese are the unit vectors pointing along the x-axis and y-axis respectively.\n\nWhen you apply a matrix $A$ to these basis vectors:\n$$A\mathbf{e}_1 = \text{first column of } A$$\n$$A\mathbf{e}_2 = \text{second column of } A$$\n\nThis is profound. The entire behavior of the transformation is determined by what it does to the basis vectors. Since any vector can be written as a linear combination of basis vectors, and linear transformations preserve linear combinations, knowing what happens to the basis tells you what happens to everything.
$$R_{90} = \\begin{bmatrix} 0 & -1 \\\\ 1 & 0 \\end{bmatrix}$$Column 1: $(0, 1)$ — where $\\mathbf{e}_1$ goes after rotation\nColumn 2: $(-1, 0)$ — where $\\mathbf{e}_2$ goes after rotationThe first column tells us: a vector pointing right (1, 0) gets rotated to point up (0, 1). The second column tells us: a vector pointing up (0, 1) gets rotated to point left (-1, 0). This IS a 90° counterclockwise rotation—you can visualize it geometrically.
Constructing transformation matrices:\n\nThis principle works in reverse. To build a matrix for any linear transformation:\n\n1. Determine what the transformation does to each standard basis vector\n2. Place those destination vectors as columns of the matrix\n3. Done—you have the matrix representation\n\nExample: Reflection across the x-axis\n- $\mathbf{e}_1 = (1, 0)$ stays at $(1, 0)$ (on the axis, no change)\n- $\mathbf{e}_2 = (0, 1)$ goes to $(0, -1)$ (flipped below the axis)\n\n$$\text{Reflection matrix} = \begin{bmatrix} 1 & 0 \\ 0 & -1 \end{bmatrix}$$\n\nExample: Scaling by factor 3 in x, factor 2 in y\n- $\mathbf{e}_1 = (1, 0)$ goes to $(3, 0)$\n- $\mathbf{e}_2 = (0, 1)$ goes to $(0, 2)$\n\n$$\text{Scaling matrix} = \begin{bmatrix} 3 & 0 \\ 0 & 2 \end{bmatrix}$$
Whenever you see a matrix, visualize a grid being transformed. The columns show where the grid lines emanating from the origin end up. This mental image makes matrix operations intuitive—you stop seeing arithmetic and start seeing geometry.
All 2D linear transformations fall into categories that combine these fundamental types. Understanding each type geometrically makes complex transformations decomposable and intuitive.
Scaling (Dilation/Contraction)\n\nScaling stretches or compresses space along the coordinate axes.\n\n$$S = \begin{bmatrix} s_x & 0 \\ 0 & s_y \end{bmatrix}$$\n\n- $s_x > 1$: stretch horizontally\n- $0 < s_x < 1$: compress horizontally\n- $s_x < 0$: flip and scale horizontally\n- $s_x = s_y$: uniform scaling (preserves shape)\n\nML Applications:\n- Feature normalization (scaling inputs to similar ranges)\n- Weight initialization in neural networks\n- Covariance matrix diagonal represents per-feature variance
Here's where the matrix perspective becomes powerful: composing transformations corresponds to multiplying matrices.\n\nIf transformation $A$ is applied first, then transformation $B$, the combined effect is:\n$$\mathbf{y} = B(A\mathbf{x}) = (BA)\mathbf{x}$$\n\nThe matrix $BA$ encapsulates both transformations in one. This is why matrix multiplication is defined the way it is—it's designed to make function composition work correctly.\n\nOrder matters:\n\nMatrix multiplication is NOT commutative: $AB \neq BA$ in general. This reflects the geometric reality that rotation then scaling differs from scaling then rotation.
$$S = \\begin{bmatrix} 2 & 0 \\\\ 0 & 1 \\end{bmatrix} \\quad R = \\begin{bmatrix} 0 & -1 \\\\ 1 & 0 \\end{bmatrix}$$Scale then Rotate: $RS = \\begin{bmatrix} 0 & -1 \\\\ 2 & 0 \\end{bmatrix}$\nRotate then Scale: $SR = \\begin{bmatrix} 0 & -2 \\\\ 1 & 0 \\end{bmatrix}$The results are different! 'Scale then rotate' stretches horizontally first, then rotates. 'Rotate then scale' rotates first, so the horizontal stretch affects what was originally the y-direction. Always read matrix products right-to-left for the order of transformations.
In the expression $ABC\mathbf{x}$, the transformations apply from right to left: first $C$, then $B$, then $A$. This is opposite to how we read English but matches how function composition works: $f(g(h(x)))$ means $h$ first, then $g$, then $f$.
The power of composition:\n\nComplex transformations can be decomposed into sequences of simple ones. The Singular Value Decomposition (SVD), which we'll study later, shows that ANY matrix can be written as:\n$$A = U \Sigma V^T$$\n\nThis means: any linear transformation is equivalent to a rotation ($V^T$), followed by a scaling ($\Sigma$), followed by another rotation ($U$). This decomposition is foundational for understanding everything from PCA to matrix compression to the numerics of neural network training.
So far, we've focused on square matrices that transform n-dimensional space to n-dimensional space. But matrices can also change the dimension of vectors.\n\nAn $m \times n$ matrix transforms n-dimensional vectors into m-dimensional vectors:\n$$A: \mathbb{R}^n \to \mathbb{R}^m$$\n\n$$A_{m \times n} \cdot \mathbf{x}{n \times 1} = \mathbf{y}{m \times 1}$$
| Matrix Shape | Transformation Type | ML Example |
|---|---|---|
| $m > n$ (more rows) | Maps to higher dimension | Lifting to feature space, embedding layers |
| $m < n$ (fewer rows) | Maps to lower dimension | Dimensionality reduction, pooling, compression |
| $m = n$ (square) | Same dimension | Rotations, reflections in latent space |
Dimension reduction example:\n\nA $2 \times 3$ matrix projects 3D vectors onto a 2D plane:\n\n$$A = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \end{bmatrix}, \quad \mathbf{x} = \begin{bmatrix} 3 \\ 4 \\ 5 \end{bmatrix}$$\n\n$$A\mathbf{x} = \begin{bmatrix} 3 \\ 4 \end{bmatrix}$$\n\nThe z-component is discarded—this is projection onto the xy-plane.\n\nDimension increase example:\n\nA $3 \times 2$ matrix lifts 2D vectors into 3D:\n\n$$A = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 0 & 0 \end{bmatrix}, \quad \mathbf{x} = \begin{bmatrix} 3 \\ 4 \end{bmatrix}$$\n\n$$A\mathbf{x} = \begin{bmatrix} 3 \\ 4 \\ 0 \end{bmatrix}$$\n\nThe 2D vector is embedded in 3D space, lying in the xy-plane.
Each fully-connected layer in a neural network uses a weight matrix to transform the input dimension. A layer going from 784 inputs (28×28 image) to 256 hidden units uses a 256×784 matrix—reducing dimension. A layer going from 256 to 512 uses a 512×256 matrix—increasing dimension. The architecture literally defines a sequence of dimension-changing transformations.
Two transformations deserve special attention for their role as 'boundary cases' in the space of all linear transformations.
The spectrum between identity and zero:\n\nThink of all possible matrices as sitting on a spectrum. At one extreme is the identity—preserving everything. At the other extreme is the zero matrix—destroying everything. In between are the interesting transformations: rotations, scalings, projections, shears, and their combinations.\n\nEigendecomposition preview:\n\nWe'll later see that eigenvalues tell us where a matrix sits on this spectrum along different directions. Eigenvalue of 1 means that direction is preserved (identity-like). Eigenvalue of 0 means that direction is destroyed (zero-like). Eigenvalues between 0 and 1 mean contraction; greater than 1 mean expansion.
We've developed a geometric understanding of matrices that goes far beyond computational mechanics. Let's consolidate the key insights:
What's next:\n\nNow that we understand what matrices ARE conceptually, we'll develop fluency with the operations we can perform on them. The next page covers matrix arithmetic—addition, scalar multiplication, and the matrix product—with the geometric perspective always in mind.
You now understand matrices as linear transformations—functions that move, stretch, rotate, and reshape space while preserving its linear structure. This geometric foundation will make every subsequent matrix topic clearer and more intuitive. Next: matrix operations and the geometry of matrix multiplication.