Machine LearningLinear Algebra

Matrices and Linear Transformations

LevelIntermediate

Duration60 mins

TopicLinear Algebra

3 / 5

Matrix Multiplication Interpretation

The Deeper Meaning of Matrix Products

In the previous page, we learned how to multiply matrices. Now we explore what matrix multiplication tells us. This isn't mere computational technique—it's the key to understanding how neural networks, dimensionality reduction, and countless other ML systems work.\n\nWhen we multiply matrices, we're not just performing arithmetic. We're asking: What happens when we compose transformations? How do linear operations chain together? What properties survive composition, and what new properties emerge?\n\nThis page develops your intuition for reading matrix products as geometric stories.

What You Will Master

By the end of this page, you will interpret matrix products as transformation pipelines, understand how composition affects geometric properties (determinant, rank, invertibility), and see neural network forward passes as matrix product chains.

Function Composition: The Foundation

Recall from Page 1: a matrix IS a linear transformation. If $A: \mathbb{R}^n \to \mathbb{R}^m$ and $B: \mathbb{R}^p \to \mathbb{R}^n$, then the product $AB: \mathbb{R}^p \to \mathbb{R}^m$ is the composition.\n\nThe dimension flow:\n\n$$\mathbf{x} \in \mathbb{R}^p \xrightarrow{B} \mathbf{u} \in \mathbb{R}^n \xrightarrow{A} \mathbf{y} \in \mathbb{R}^m$$\n\nThe product $AB$ compresses this pipeline into a single operation:\n$$\mathbf{x} \in \mathbb{R}^p \xrightarrow{AB} \mathbf{y} \in \mathbb{R}^m$$\n\nKey insight: The intermediate dimension $n$ (where $\mathbf{u}$ lives) disappears in the product. The inner dimensions must match for composition to be valid—this isn't an arbitrary rule; it's the requirement that the codomain of $B$ equals the domain of $A$.

Dimension Flow in a Neural Network LayerConsider two consecutive layers:

Input

Input: 784 features (28×28 image)\nLayer 1: 784 → 256 ($W_1$ is $256 \\times 784$)\nLayer 2: 256 → 128 ($W_2$ is $128 \\times 256$)

Output

Product: $W_2 W_1$ is $128 \\times 784$\nMaps directly from 784 input to 128 features, 'collapsing' the 256-dimensional hidden layer.

Explanation

Without nonlinear activations between layers, stacking linear layers is equivalent to one linear layer with the product matrix. This is why activations are essential—they break the linearity and allow deep networks to learn complex functions that a single layer cannot represent.

Why Deep Linear Networks Are Shallow

A key theorem: any sequence of linear transformations is equivalent to a single linear transformation. If you stack 100 linear layers without activations: $W_{100} W_{99} ... W_2 W_1 = W_{combined}$. This single matrix can perfectly replicate the deep network. Nonlinear activations (ReLU, sigmoid, etc.) break this collapsibility, enabling true depth.

How Geometric Properties Compose

When transformations compose, their geometric properties combine in predictable ways. Understanding these rules lets you reason about complex transformation chains.

Composition of Geometric Properties
Property	How It Composes	Formula/Rule
Determinant	Multiplies	$\|AB\| = \|A\| \cdot \|B\|$
Trace	Does NOT simply compose	$\text{tr}(AB) \neq \text{tr}(A)\text{tr}(B)$ generally
Rank	At most min of ranks	$\text{rank}(AB) \leq \min(\text{rank}(A), \text{rank}(B))$
Invertibility	Both must be invertible	$(AB)^{-1}$ exists iff $A^{-1}$ and $B^{-1}$ exist
Orthogonality	Preserves if both orthogonal	If $A^T A = I$ and $B^T B = I$, then $(AB)^T(AB) = I$
Symmetry	Does NOT preserve	$(AB)^T = B^T A^T \neq AB$ generally

Determinant multiplication—the key insight:\n\nThe determinant measures how a transformation scales areas (2D) or volumes (nD). If $A$ scales area by factor 3 and $B$ scales area by factor 2, then $AB$ scales area by factor 6.\n\n$$|AB| = |A| \cdot |B|$$\n\nImplications:\n- If either $|A| = 0$ or $|B| = 0$, then $|AB| = 0$ (collapses dimension)\n- If $|A| < 0$, $A$ reverses orientation; if $|B| < 0$, so does $B$; their product $AB$ preserves orientation (negative × negative = positive)\n- Invertibility requires $|AB| \neq 0$, which needs both $|A| \neq 0$ and $|B| \neq 0$

Determinant in Neural Networks

In normalizing flows (generative models), we track how transformations change probability density. The formula involves determinants of Jacobian matrices. Composition means determinants multiply—so the total volume change through the flow is the product of per-layer changes. This is why normalizing flows often use triangular matrices: their determinants are just products of diagonal entries, easy to compute.

Reading Transformation Chains Right-to-Left

When you see a product like $ABC\mathbf{x}$, read it right-to-left to understand the sequence of operations:\n\n$$ABC\mathbf{x} = A(B(C\mathbf{x}))$$\n\n1. First, $C$ transforms $\mathbf{x}$\n2. Then, $B$ transforms the result\n3. Finally, $A$ transforms that result\n\nThis is because function composition works inside-out, and matrix products are designed to match this.

Building a Complex Transformation Step by StepSuppose we want to: (1) scale by 2, (2) rotate 45°, (3) reflect across x-axis.

Input

$$S = \\begin{bmatrix} 2 & 0 \\\\ 0 & 2 \\end{bmatrix}, \\quad R = \\begin{bmatrix} .707 & -.707 \\\\ .707 & .707 \\end{bmatrix}, \\quad F = \\begin{bmatrix} 1 & 0 \\\\ 0 & -1 \\end{bmatrix}$$

Output

The combined transformation is: $FRS$\n\nApply to vector $(1, 0)$:\n1. $S(1,0) = (2, 0)$\n2. $R(2, 0) = (1.414, 1.414)$\n3. $F(1.414, 1.414) = (1.414, -1.414)$

Explanation

Reading the product $FRS$ right-to-left matches the operation order: scale first ($S$), rotate second ($R$), reflect last ($F$). The matrix $FRS$ encapsulates all three steps in one multiplication.

Convention Varies

Some fields (especially computer graphics) use row vectors instead of column vectors, writing $\mathbf{x}^T A$ instead of $A\mathbf{x}$. In that convention, transformation order is left-to-right. Always check which convention is being used. In ML and most mathematical texts, column vectors are standard, so right-to-left applies.

Change of Basis: A Powerful Perspective

Matrix products have another profound interpretation: change of basis. When you transform a matrix by conjugation ($P^{-1}AP$), you're viewing the same transformation $A$ from a different coordinate system.\n\nThe setup:\n\n- $A$ is a transformation expressed in the standard basis\n- $P$ is a matrix whose columns are a new basis $\{\mathbf{p}_1, \mathbf{p}_2, ...\}$\n- $P^{-1}AP$ is the same transformation expressed in the new basis\n\nWhy this matters:\n\nSome transformations look complicated in one basis but simple in another. The goal of diagonalization (finding eigenvalues) is to find a basis where the transformation becomes pure scaling—a diagonal matrix.

Step-by-step interpretation of $P^{-1}AP$:\n\n1. $P^{-1}$: Convert input from standard coordinates to $P$-basis coordinates\n2. $A$: Apply the transformation (still in standard coordinates)\n3. Wait—that's backwards! Let's re-derive...\n\nActually, think of it this way for $\mathbf{y} = P^{-1}AP\mathbf{x}$:\n\n1. $P\mathbf{x}$: Takes coordinates in $P$-basis, converts to standard\n2. $A(P\mathbf{x})$: Applies transformation in standard basis\n3. $P^{-1}(AP\mathbf{x})$: Converts result back to $P$-basis\n\nThe beautiful result: If $P$ consists of eigenvectors of $A$, then $P^{-1}AP = D$ where $D$ is diagonal with eigenvalues! The transformation, in its natural basis, is just scaling.

PCA is a Change of Basis

Principal Component Analysis finds a basis (the principal components) where the covariance matrix becomes diagonal. In this new coordinate system, features are uncorrelated and ordered by variance. The 'principal components' ARE the eigenvectors of the covariance matrix. PCA literally changes to the natural basis of your data's spread.

Same Transformation, Different ViewsA shear in standard basis becomes something simpler in the right basis:

Input

$$A = \\begin{bmatrix} 1 & 1 \\\\ 0 & 1 \\end{bmatrix}$$\n\nThis shears horizontally. What are its eigenvalues?

Output

Eigenvalues: both equal 1 (repeated eigenvalue).\nBut $A$ is NOT diagonal, so it's not diagonalizable.\nIn Jordan form: $\\begin{bmatrix} 1 & 1 \\\\ 0 & 1 \\end{bmatrix}$ (already Jordan canonical!)

Explanation

Not every matrix can be diagonalized—some transformations have intrinsic 'shear-like' behavior that no basis change can remove. Jordan form is the closest we can get for non-diagonalizable matrices.

Neural Networks as Matrix Product Chains

The forward pass of a feedforward neural network is fundamentally a sequence of matrix multiplications interleaved with nonlinear activations:\n\n$$\mathbf{h}_1 = \sigma(W_1 \mathbf{x} + \mathbf{b}_1)$$\n$$\mathbf{h}_2 = \sigma(W_2 \mathbf{h}_1 + \mathbf{b}2)$$\n$$\vdots$$\n$$\mathbf{y} = W_L \mathbf{h}{L-1} + \mathbf{b}L$$\n\nThe linear part (ignoring biases and activations momentarily):\n$$\mathbf{y} = W_L W{L-1} ... W_2 W_1 \mathbf{x}$$\n\nThis is a product of many matrices! Each weight matrix transforms the representation, changing dimension, rotating, scaling, and projecting.

What Each Layer's Weight Matrix Does

•Dimension change: A $256 \times 784$ matrix compresses 784D to 256D, extracting key features
•Feature mixing: Off-diagonal entries let features interact, creating combinations
•Representation learning: The network learns transformations that make the final task (classification, regression) linear
•Hierarchical abstraction: Early layers extract simple features; later layers compose these into complex concepts

The Geometry of Deep Learning

A beautifully geometric view: each layer of a neural network folds, stretches, and warps the input space. The goal is to find a sequence of transformations that 'unfolds' the data into a configuration where classes are linearly separable. The network literally learns to warp space until the problem becomes easy.

Why depth matters (with activations):\n\nWithout activations, $W_L ... W_1$ collapses to a single matrix. With activations, each layer can create folds and bends that a single linear transform cannot.\n\nConsider classifying two interleaved spirals. No linear transformation can separate them. But a deep network can:\n1. First layers: detect local curve orientations\n2. Middle layers: identify spiral arm membership\n3. Final layers: map arms to class labels\n\nEach matrix transforms the representation, and each nonlinearity enables new folding. The matrix product chain, punctuated by nonlinearities, is the essence of deep learning's power.

Exploiting Associativity for Efficiency

Matrix multiplication is associative: $(AB)C = A(BC)$. Both give the same result, but computational costs can differ dramatically. This matters enormously in ML, where we often multiply chains of matrices with different shapes.

The Matrix Chain Multiplication ProblemConsider computing $ABCD$ where dimensions are:

Input

$A$: $10 \\times 100$\n$B$: $100 \\times 5$\n$C$: $5 \\times 50$\n$D$: $50 \\times 1$

Output

Different orderings have vastly different costs:\n\n• $((AB)C)D$: $10 \\cdot 100 \\cdot 5 + 10 \\cdot 5 \\cdot 50 + 10 \\cdot 50 \\cdot 1 = 5000 + 2500 + 500 = 8000$ ops\n\n• $(A(BC))D$: $100 \\cdot 5 \\cdot 50 + 10 \\cdot 100 \\cdot 50 + 10 \\cdot 50 \\cdot 1 = 25000 + 50000 + 500 = 75500$ ops\n\n• $A(B(CD))$: $50 \\cdot 1 \\cdot 5 \\cdot + 100 \\cdot 5 \\cdot 1 + 10 \\cdot 100 \\cdot 1 = 250 + 500 + 1000 = 1750$ ops

Explanation

The best order ($A(B(CD))$) is nearly 50× faster than the worst! Dynamic programming finds optimal parenthesization in $O(n^3)$ for $n$ matrices. Deep learning frameworks do this automatically when possible.

Practical Heuristic

When computing a chain ending with a vector ($M_n...M_2 M_1 \mathbf{x}$), always multiply right-to-left: $M_n(...(M_2(M_1 \mathbf{x})))$. Each intermediate result is a vector, keeping dimensions small. Never form the full matrix product first—that's almost always slower.

Products with Special Matrix Types

Certain matrix types have special multiplication properties that simplify computation and reveal structure.

Diagonal matrices:\n\n$$D = \begin{bmatrix} d_1 & 0 & \cdots \\ 0 & d_2 & \cdots \\ \vdots & \vdots & \ddots \end{bmatrix}$$\n\nMultiplication is easy: $DA$ scales rows of $A$ by diagonal entries. $AD$ scales columns of $A$.\n\nProduct of diagonals: $D_1 D_2$ is diagonal with entries $d_{1i} \cdot d_{2i}$. Just multiply corresponding diagonal entries!\n\nInverse: $D^{-1}$ has entries $1/d_i$. Trivial to compute if no zeros.\n\nML use: Batch normalization scaling, attention score weighting, per-feature learning rates.

Summary: Reading Matrix Products

We've developed deep intuition for interpreting matrix multiplication as transformation composition. Here's what to take forward:

Key Takeaways

•Matrix products are function compositions — The product $AB$ means apply $B$ first, then $A$. Inner dimensions must match because the codomain of one must equal the domain of the next.
•Geometric properties compose predictably — Determinants multiply, ranks can only decrease, invertibility requires both factors invertible.
•Read products right-to-left — $ABC\mathbf{x}$ applies $C$ first, $B$ second, $A$ last. This matches function composition notation.
•Change of basis is a product — $P^{-1}AP$ expresses $A$ in a new coordinate system. Diagonalization finds a basis where $A$ becomes pure scaling.
•Neural networks are matrix chains — Without activations, layers collapse. With activations, the network can warp space to make problems linearly separable.
•Associativity enables optimization — Different parenthesizations give same result but vastly different costs. Multiply right-to-left when ending with a vector.
•Special matrices have special products — Diagonal products are easy, orthogonal products stay orthogonal, symmetric products need care, triangular products stay triangular.

What's next:\n\nWe've mastered how matrices transform space and compose. The next page explores the critical concepts of rank and nullity—what these tell us about a transformation's dimension-collapsing behavior, and why this matters for understanding when linear systems have solutions and how ML models can fail.

Page Complete

You now interpret matrix products geometrically and understand their role in ML systems. Matrix multiplication isn't just arithmetic—it's the algebra of transformation pipelines. Next: rank, nullity, and the fundamental theorem of linear algebra.

3 / 5

Loading learning content...

Machine LearningLinear Algebra

Matrices and Linear Transformations

LevelIntermediate

Duration60 mins

TopicLinear Algebra

3 / 5

Matrix Multiplication Interpretation

The Deeper Meaning of Matrix Products

What You Will Master

Function Composition: The Foundation

Dimension Flow in a Neural Network LayerConsider two consecutive layers:

Input

Input: 784 features (28×28 image)\nLayer 1: 784 → 256 ($W_1$ is $256 \\times 784$)\nLayer 2: 256 → 128 ($W_2$ is $128 \\times 256$)

Output

Product: $W_2 W_1$ is $128 \\times 784$\nMaps directly from 784 input to 128 features, 'collapsing' the 256-dimensional hidden layer.

Explanation

Why Deep Linear Networks Are Shallow

How Geometric Properties Compose

When transformations compose, their geometric properties combine in predictable ways. Understanding these rules lets you reason about complex transformation chains.

Composition of Geometric Properties
Property	How It Composes	Formula/Rule
Determinant	Multiplies	$\|AB\| = \|A\| \cdot \|B\|$
Trace	Does NOT simply compose	$\text{tr}(AB) \neq \text{tr}(A)\text{tr}(B)$ generally
Rank	At most min of ranks	$\text{rank}(AB) \leq \min(\text{rank}(A), \text{rank}(B))$
Invertibility	Both must be invertible	$(AB)^{-1}$ exists iff $A^{-1}$ and $B^{-1}$ exist
Orthogonality	Preserves if both orthogonal	If $A^T A = I$ and $B^T B = I$, then $(AB)^T(AB) = I$
Symmetry	Does NOT preserve	$(AB)^T = B^T A^T \neq AB$ generally

Determinant in Neural Networks

Reading Transformation Chains Right-to-Left

Building a Complex Transformation Step by StepSuppose we want to: (1) scale by 2, (2) rotate 45°, (3) reflect across x-axis.

Input

$$S = \\begin{bmatrix} 2 & 0 \\\\ 0 & 2 \\end{bmatrix}, \\quad R = \\begin{bmatrix} .707 & -.707 \\\\ .707 & .707 \\end{bmatrix}, \\quad F = \\begin{bmatrix} 1 & 0 \\\\ 0 & -1 \\end{bmatrix}$$

Output

The combined transformation is: $FRS$\n\nApply to vector $(1, 0)$:\n1. $S(1,0) = (2, 0)$\n2. $R(2, 0) = (1.414, 1.414)$\n3. $F(1.414, 1.414) = (1.414, -1.414)$

Explanation

Reading the product $FRS$ right-to-left matches the operation order: scale first ($S$), rotate second ($R$), reflect last ($F$). The matrix $FRS$ encapsulates all three steps in one multiplication.

Convention Varies

Change of Basis: A Powerful Perspective

PCA is a Change of Basis

Same Transformation, Different ViewsA shear in standard basis becomes something simpler in the right basis:

Input

$$A = \\begin{bmatrix} 1 & 1 \\\\ 0 & 1 \\end{bmatrix}$$\n\nThis shears horizontally. What are its eigenvalues?

Output

Eigenvalues: both equal 1 (repeated eigenvalue).\nBut $A$ is NOT diagonal, so it's not diagonalizable.\nIn Jordan form: $\\begin{bmatrix} 1 & 1 \\\\ 0 & 1 \\end{bmatrix}$ (already Jordan canonical!)

Explanation

Neural Networks as Matrix Product Chains

What Each Layer's Weight Matrix Does

•Dimension change: A $256 \times 784$ matrix compresses 784D to 256D, extracting key features
•Feature mixing: Off-diagonal entries let features interact, creating combinations
•Representation learning: The network learns transformations that make the final task (classification, regression) linear
•Hierarchical abstraction: Early layers extract simple features; later layers compose these into complex concepts

The Geometry of Deep Learning

Exploiting Associativity for Efficiency

The Matrix Chain Multiplication ProblemConsider computing $ABCD$ where dimensions are:

Input

$A$: $10 \\times 100$\n$B$: $100 \\times 5$\n$C$: $5 \\times 50$\n$D$: $50 \\times 1$

Output

Different orderings have vastly different costs:\n\n• $((AB)C)D$: $10 \\cdot 100 \\cdot 5 + 10 \\cdot 5 \\cdot 50 + 10 \\cdot 50 \\cdot 1 = 5000 + 2500 + 500 = 8000$ ops\n\n• $(A(BC))D$: $100 \\cdot 5 \\cdot 50 + 10 \\cdot 100 \\cdot 50 + 10 \\cdot 50 \\cdot 1 = 25000 + 50000 + 500 = 75500$ ops\n\n• $A(B(CD))$: $50 \\cdot 1 \\cdot 5 \\cdot + 100 \\cdot 5 \\cdot 1 + 10 \\cdot 100 \\cdot 1 = 250 + 500 + 1000 = 1750$ ops

Explanation

Practical Heuristic

Products with Special Matrix Types

Certain matrix types have special multiplication properties that simplify computation and reveal structure.

Summary: Reading Matrix Products

We've developed deep intuition for interpreting matrix multiplication as transformation composition. Here's what to take forward:

Key Takeaways

•Matrix products are function compositions — The product $AB$ means apply $B$ first, then $A$. Inner dimensions must match because the codomain of one must equal the domain of the next.
•Geometric properties compose predictably — Determinants multiply, ranks can only decrease, invertibility requires both factors invertible.
•Read products right-to-left — $ABC\mathbf{x}$ applies $C$ first, $B$ second, $A$ last. This matches function composition notation.
•Change of basis is a product — $P^{-1}AP$ expresses $A$ in a new coordinate system. Diagonalization finds a basis where $A$ becomes pure scaling.
•Neural networks are matrix chains — Without activations, layers collapse. With activations, the network can warp space to make problems linearly separable.
•Associativity enables optimization — Different parenthesizations give same result but vastly different costs. Multiply right-to-left when ending with a vector.
•Special matrices have special products — Diagonal products are easy, orthogonal products stay orthogonal, symmetric products need care, triangular products stay triangular.

Page Complete

3 / 5