Loading learning content...
In the previous page, we learned how to multiply matrices. Now we explore what matrix multiplication tells us. This isn't mere computational technique—it's the key to understanding how neural networks, dimensionality reduction, and countless other ML systems work.\n\nWhen we multiply matrices, we're not just performing arithmetic. We're asking: What happens when we compose transformations? How do linear operations chain together? What properties survive composition, and what new properties emerge?\n\nThis page develops your intuition for reading matrix products as geometric stories.
By the end of this page, you will interpret matrix products as transformation pipelines, understand how composition affects geometric properties (determinant, rank, invertibility), and see neural network forward passes as matrix product chains.
Recall from Page 1: a matrix IS a linear transformation. If $A: \mathbb{R}^n \to \mathbb{R}^m$ and $B: \mathbb{R}^p \to \mathbb{R}^n$, then the product $AB: \mathbb{R}^p \to \mathbb{R}^m$ is the composition.\n\nThe dimension flow:\n\n$$\mathbf{x} \in \mathbb{R}^p \xrightarrow{B} \mathbf{u} \in \mathbb{R}^n \xrightarrow{A} \mathbf{y} \in \mathbb{R}^m$$\n\nThe product $AB$ compresses this pipeline into a single operation:\n$$\mathbf{x} \in \mathbb{R}^p \xrightarrow{AB} \mathbf{y} \in \mathbb{R}^m$$\n\nKey insight: The intermediate dimension $n$ (where $\mathbf{u}$ lives) disappears in the product. The inner dimensions must match for composition to be valid—this isn't an arbitrary rule; it's the requirement that the codomain of $B$ equals the domain of $A$.
Input: 784 features (28×28 image)\nLayer 1: 784 → 256 ($W_1$ is $256 \\times 784$)\nLayer 2: 256 → 128 ($W_2$ is $128 \\times 256$)Product: $W_2 W_1$ is $128 \\times 784$\nMaps directly from 784 input to 128 features, 'collapsing' the 256-dimensional hidden layer.Without nonlinear activations between layers, stacking linear layers is equivalent to one linear layer with the product matrix. This is why activations are essential—they break the linearity and allow deep networks to learn complex functions that a single layer cannot represent.
A key theorem: any sequence of linear transformations is equivalent to a single linear transformation. If you stack 100 linear layers without activations: $W_{100} W_{99} ... W_2 W_1 = W_{combined}$. This single matrix can perfectly replicate the deep network. Nonlinear activations (ReLU, sigmoid, etc.) break this collapsibility, enabling true depth.
When transformations compose, their geometric properties combine in predictable ways. Understanding these rules lets you reason about complex transformation chains.
| Property | How It Composes | Formula/Rule |
|---|---|---|
| Determinant | Multiplies | $|AB| = |A| \cdot |B|$ |
| Trace | Does NOT simply compose | $\text{tr}(AB) \neq \text{tr}(A)\text{tr}(B)$ generally |
| Rank | At most min of ranks | $\text{rank}(AB) \leq \min(\text{rank}(A), \text{rank}(B))$ |
| Invertibility | Both must be invertible | $(AB)^{-1}$ exists iff $A^{-1}$ and $B^{-1}$ exist |
| Orthogonality | Preserves if both orthogonal | If $A^T A = I$ and $B^T B = I$, then $(AB)^T(AB) = I$ |
| Symmetry | Does NOT preserve | $(AB)^T = B^T A^T \neq AB$ generally |
Determinant multiplication—the key insight:\n\nThe determinant measures how a transformation scales areas (2D) or volumes (nD). If $A$ scales area by factor 3 and $B$ scales area by factor 2, then $AB$ scales area by factor 6.\n\n$$|AB| = |A| \cdot |B|$$\n\nImplications:\n- If either $|A| = 0$ or $|B| = 0$, then $|AB| = 0$ (collapses dimension)\n- If $|A| < 0$, $A$ reverses orientation; if $|B| < 0$, so does $B$; their product $AB$ preserves orientation (negative × negative = positive)\n- Invertibility requires $|AB| \neq 0$, which needs both $|A| \neq 0$ and $|B| \neq 0$
In normalizing flows (generative models), we track how transformations change probability density. The formula involves determinants of Jacobian matrices. Composition means determinants multiply—so the total volume change through the flow is the product of per-layer changes. This is why normalizing flows often use triangular matrices: their determinants are just products of diagonal entries, easy to compute.
When you see a product like $ABC\mathbf{x}$, read it right-to-left to understand the sequence of operations:\n\n$$ABC\mathbf{x} = A(B(C\mathbf{x}))$$\n\n1. First, $C$ transforms $\mathbf{x}$\n2. Then, $B$ transforms the result\n3. Finally, $A$ transforms that result\n\nThis is because function composition works inside-out, and matrix products are designed to match this.
$$S = \\begin{bmatrix} 2 & 0 \\\\ 0 & 2 \\end{bmatrix}, \\quad R = \\begin{bmatrix} .707 & -.707 \\\\ .707 & .707 \\end{bmatrix}, \\quad F = \\begin{bmatrix} 1 & 0 \\\\ 0 & -1 \\end{bmatrix}$$The combined transformation is: $FRS$\n\nApply to vector $(1, 0)$:\n1. $S(1,0) = (2, 0)$\n2. $R(2, 0) = (1.414, 1.414)$\n3. $F(1.414, 1.414) = (1.414, -1.414)$Reading the product $FRS$ right-to-left matches the operation order: scale first ($S$), rotate second ($R$), reflect last ($F$). The matrix $FRS$ encapsulates all three steps in one multiplication.
Some fields (especially computer graphics) use row vectors instead of column vectors, writing $\mathbf{x}^T A$ instead of $A\mathbf{x}$. In that convention, transformation order is left-to-right. Always check which convention is being used. In ML and most mathematical texts, column vectors are standard, so right-to-left applies.
Matrix products have another profound interpretation: change of basis. When you transform a matrix by conjugation ($P^{-1}AP$), you're viewing the same transformation $A$ from a different coordinate system.\n\nThe setup:\n\n- $A$ is a transformation expressed in the standard basis\n- $P$ is a matrix whose columns are a new basis $\{\mathbf{p}_1, \mathbf{p}_2, ...\}$\n- $P^{-1}AP$ is the same transformation expressed in the new basis\n\nWhy this matters:\n\nSome transformations look complicated in one basis but simple in another. The goal of diagonalization (finding eigenvalues) is to find a basis where the transformation becomes pure scaling—a diagonal matrix.
Step-by-step interpretation of $P^{-1}AP$:\n\n1. $P^{-1}$: Convert input from standard coordinates to $P$-basis coordinates\n2. $A$: Apply the transformation (still in standard coordinates)\n3. Wait—that's backwards! Let's re-derive...\n\nActually, think of it this way for $\mathbf{y} = P^{-1}AP\mathbf{x}$:\n\n1. $P\mathbf{x}$: Takes coordinates in $P$-basis, converts to standard\n2. $A(P\mathbf{x})$: Applies transformation in standard basis\n3. $P^{-1}(AP\mathbf{x})$: Converts result back to $P$-basis\n\nThe beautiful result: If $P$ consists of eigenvectors of $A$, then $P^{-1}AP = D$ where $D$ is diagonal with eigenvalues! The transformation, in its natural basis, is just scaling.
Principal Component Analysis finds a basis (the principal components) where the covariance matrix becomes diagonal. In this new coordinate system, features are uncorrelated and ordered by variance. The 'principal components' ARE the eigenvectors of the covariance matrix. PCA literally changes to the natural basis of your data's spread.
$$A = \\begin{bmatrix} 1 & 1 \\\\ 0 & 1 \\end{bmatrix}$$\n\nThis shears horizontally. What are its eigenvalues?Eigenvalues: both equal 1 (repeated eigenvalue).\nBut $A$ is NOT diagonal, so it's not diagonalizable.\nIn Jordan form: $\\begin{bmatrix} 1 & 1 \\\\ 0 & 1 \\end{bmatrix}$ (already Jordan canonical!)Not every matrix can be diagonalized—some transformations have intrinsic 'shear-like' behavior that no basis change can remove. Jordan form is the closest we can get for non-diagonalizable matrices.
The forward pass of a feedforward neural network is fundamentally a sequence of matrix multiplications interleaved with nonlinear activations:\n\n$$\mathbf{h}_1 = \sigma(W_1 \mathbf{x} + \mathbf{b}_1)$$\n$$\mathbf{h}_2 = \sigma(W_2 \mathbf{h}_1 + \mathbf{b}2)$$\n$$\vdots$$\n$$\mathbf{y} = W_L \mathbf{h}{L-1} + \mathbf{b}L$$\n\nThe linear part (ignoring biases and activations momentarily):\n$$\mathbf{y} = W_L W{L-1} ... W_2 W_1 \mathbf{x}$$\n\nThis is a product of many matrices! Each weight matrix transforms the representation, changing dimension, rotating, scaling, and projecting.
A beautifully geometric view: each layer of a neural network folds, stretches, and warps the input space. The goal is to find a sequence of transformations that 'unfolds' the data into a configuration where classes are linearly separable. The network literally learns to warp space until the problem becomes easy.
Why depth matters (with activations):\n\nWithout activations, $W_L ... W_1$ collapses to a single matrix. With activations, each layer can create folds and bends that a single linear transform cannot.\n\nConsider classifying two interleaved spirals. No linear transformation can separate them. But a deep network can:\n1. First layers: detect local curve orientations\n2. Middle layers: identify spiral arm membership\n3. Final layers: map arms to class labels\n\nEach matrix transforms the representation, and each nonlinearity enables new folding. The matrix product chain, punctuated by nonlinearities, is the essence of deep learning's power.
Matrix multiplication is associative: $(AB)C = A(BC)$. Both give the same result, but computational costs can differ dramatically. This matters enormously in ML, where we often multiply chains of matrices with different shapes.
$A$: $10 \\times 100$\n$B$: $100 \\times 5$\n$C$: $5 \\times 50$\n$D$: $50 \\times 1$Different orderings have vastly different costs:\n\n• $((AB)C)D$: $10 \\cdot 100 \\cdot 5 + 10 \\cdot 5 \\cdot 50 + 10 \\cdot 50 \\cdot 1 = 5000 + 2500 + 500 = 8000$ ops\n\n• $(A(BC))D$: $100 \\cdot 5 \\cdot 50 + 10 \\cdot 100 \\cdot 50 + 10 \\cdot 50 \\cdot 1 = 25000 + 50000 + 500 = 75500$ ops\n\n• $A(B(CD))$: $50 \\cdot 1 \\cdot 5 \\cdot + 100 \\cdot 5 \\cdot 1 + 10 \\cdot 100 \\cdot 1 = 250 + 500 + 1000 = 1750$ opsThe best order ($A(B(CD))$) is nearly 50× faster than the worst! Dynamic programming finds optimal parenthesization in $O(n^3)$ for $n$ matrices. Deep learning frameworks do this automatically when possible.
When computing a chain ending with a vector ($M_n...M_2 M_1 \mathbf{x}$), always multiply right-to-left: $M_n(...(M_2(M_1 \mathbf{x})))$. Each intermediate result is a vector, keeping dimensions small. Never form the full matrix product first—that's almost always slower.
Certain matrix types have special multiplication properties that simplify computation and reveal structure.
Diagonal matrices:\n\n$$D = \begin{bmatrix} d_1 & 0 & \cdots \\ 0 & d_2 & \cdots \\ \vdots & \vdots & \ddots \end{bmatrix}$$\n\nMultiplication is easy: $DA$ scales rows of $A$ by diagonal entries. $AD$ scales columns of $A$.\n\nProduct of diagonals: $D_1 D_2$ is diagonal with entries $d_{1i} \cdot d_{2i}$. Just multiply corresponding diagonal entries!\n\nInverse: $D^{-1}$ has entries $1/d_i$. Trivial to compute if no zeros.\n\nML use: Batch normalization scaling, attention score weighting, per-feature learning rates.
We've developed deep intuition for interpreting matrix multiplication as transformation composition. Here's what to take forward:
What's next:\n\nWe've mastered how matrices transform space and compose. The next page explores the critical concepts of rank and nullity—what these tell us about a transformation's dimension-collapsing behavior, and why this matters for understanding when linear systems have solutions and how ML models can fail.
You now interpret matrix products geometrically and understand their role in ML systems. Matrix multiplication isn't just arithmetic—it's the algebra of transformation pipelines. Next: rank, nullity, and the fundamental theorem of linear algebra.