Loading content...
Having established that matrices represent linear transformations, we now turn to the operations we can perform on them. These aren't arbitrary rules invented for computational convenience—they're the natural algebraic expressions of geometric operations.
The key principle: Every matrix operation has a geometric meaning. Addition blends transformations. Scalar multiplication strengthens or weakens them. Matrix multiplication composes them. The transpose reflects across a symmetry axis. Understanding these connections makes matrix algebra feel inevitable rather than arbitrary.
By the end of this page, you will perform matrix operations fluently while understanding their geometric significance. You'll see why these operations are defined the way they are, and how they combine to express complex transformations concisely.
Matrix addition is the simplest operation: add corresponding elements.
$$A + B = \begin{bmatrix} a_{11} + b_{11} & a_{12} + b_{12} \\ a_{21} + b_{21} & a_{22} + b_{22} \end{bmatrix}$$
Requirement: Matrices must have the same dimensions. You cannot add a 2×3 matrix to a 3×2 matrix.
Geometric interpretation:
If $A$ sends the basis vector $\mathbf{e}_1$ to column 1 of $A$, and $B$ sends $\mathbf{e}_1$ to column 1 of $B$, then $(A+B)$ sends $\mathbf{e}_1$ to the vector sum of those destinations. It's like averaging or blending two transformations.
$$R = \\begin{bmatrix} 0.707 & -0.707 \\\\ 0.707 & 0.707 \\end{bmatrix}, \\quad S = \\begin{bmatrix} 2 & 0 \\\\ 0 & 2 \\end{bmatrix}$$$$R + S = \\begin{bmatrix} 2.707 & -0.707 \\\\ 0.707 & 2.707 \\end{bmatrix}$$This sum doesn't represent 'rotate then scale' or 'scale then rotate'—those would be matrix products. Instead, it creates a new transformation that blends properties of both. The result scales diagonally while adding a rotational shear.
Matrix addition is most meaningful when matrices represent similar types of operations. Adding a rotation to a scaling gives a valid matrix but may not have an intuitive geometric interpretation. Addition is natural for: combining effects (like forces), averaging transformations, or creating linear interpolations between states.
Properties of matrix addition:
| Property | Statement | Meaning |
|---|---|---|
| Commutative | $A + B = B + A$ | Order doesn't matter |
| Associative | $(A + B) + C = A + (B + C)$ | Grouping doesn't matter |
| Additive Identity | $A + O = A$ | Zero matrix is neutral |
| Additive Inverse | $A + (-A) = O$ | Every matrix has a negation |
Scalar multiplication multiplies every entry by a constant:
$$cA = \begin{bmatrix} ca_{11} & ca_{12} \\ ca_{21} & ca_{22} \end{bmatrix}$$
Geometric interpretation:
If $A$ is a transformation, then $cA$ applies the same transformation but scales the entire output by $c$. This is equivalent to first applying $A$, then uniformly scaling by $c$.
| Scalar $c$ | Effect on Transformation | Example Use |
|---|---|---|
| $c > 1$ | Amplify the transformation | Stronger effect |
| $0 < c < 1$ | Dampen the transformation | Partial application |
| $c = 0$ | Collapse to zero matrix | Nullify transform |
| $c = -1$ | Reverse the transformation | Opposite direction |
| $c < -1$ | Reverse and amplify | Strong opposite effect |
Scalar multiplication and rotation:
Consider a 90° rotation matrix: $$R_{90} = \begin{bmatrix} 0 & -1 \\ 1 & 0 \end{bmatrix}$$
Then $2R_{90}$: $$2R_{90} = \begin{bmatrix} 0 & -2 \\ 2 & 0 \end{bmatrix}$$
This rotates by 90° AND scales by factor 2. The basis vectors $\mathbf{e}_1$ and $\mathbf{e}_2$ move to $(0, 2)$ and $(-2, 0)$ respectively—rotated AND stretched.
In gradient descent, the learning rate acts like a scalar multiplier on the gradient matrix. A learning rate of 0.01 means 'apply 1% of the suggested update direction.' Too large (>1) and updates overshoot; too small and learning stalls. Scalar multiplication literally controls learning speed.
The transpose of matrix $A$, denoted $A^T$, is obtained by swapping rows and columns:
$$(A^T){ij} = A{ji}$$
For a 2×3 matrix, the transpose is 3×2:
$$A = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix} \implies A^T = \begin{bmatrix} 1 & 4 \\ 2 & 5 \\ 3 & 6 \end{bmatrix}$$
Geometric interpretation:
For square matrices, the transpose reflects the transformation across the main diagonal of the matrix. More precisely, if $A$ transforms via certain angles and scales, $A^T$ uses a 'mirror image' of those parameters.
For rectangular matrices, transpose swaps the domain and codomain dimensions—a $3 \times 5$ matrix (maps $\mathbb{R}^5 \to \mathbb{R}^3$) transposes to a $5 \times 3$ matrix (maps $\mathbb{R}^3 \to \mathbb{R}^5$).
The transpose of a product reverses the order: $(AB)^T = B^T A^T$. This is crucial and often forgotten. Geometrically: if we transpose the composition of two transformations, we must compose the transposes in reverse order. This property appears constantly in gradient derivations for neural networks.
Special matrix types defined by transpose:
| Type | Definition | Geometric Meaning |
|---|---|---|
| Symmetric | $A = A^T$ | Transformation has perpendicular eigenvectors |
| Skew-symmetric | $A = -A^T$ | Pure rotation (in 3D: infinitesimal rotation) |
| Orthogonal | $A^T = A^{-1}$ | Preserves lengths and angles |
ML Applications:
Matrix-vector multiplication is the fundamental operation that applies a transformation to a vector. There are two equivalent ways to interpret it, and understanding both provides deeper insight.
Row-wise interpretation (dot products):
Each component of the output is the dot product of a row of $A$ with the input vector:
$$\begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} a_{11}x_1 + a_{12}x_2 \\ a_{21}x_1 + a_{22}x_2 \end{bmatrix}$$
Output component $y_i = (\text{row } i) \cdot \mathbf{x}$
Geometric meaning: Each output component measures how much the input aligns with a particular direction defined by the row.
ML example: In a neural network layer, each neuron computes a weighted sum of inputs—that's a dot product. The row of the weight matrix contains that neuron's learned sensitivities to each input feature.
$$A = \\begin{bmatrix} 2 & 1 \\\\ 0 & 3 \\end{bmatrix}, \\quad \\mathbf{x} = \\begin{bmatrix} 4 \\\\ 2 \\end{bmatrix}$$**Row picture:** $y_1 = 2(4) + 1(2) = 10$, $y_2 = 0(4) + 3(2) = 6$ → $(10, 6)$
**Column picture:** $4\\begin{bmatrix}2\\\\0\\end{bmatrix} + 2\\begin{bmatrix}1\\\\3\\end{bmatrix} = \\begin{bmatrix}8\\\\0\\end{bmatrix} + \\begin{bmatrix}2\\\\6\\end{bmatrix} = \\begin{bmatrix}10\\\\6\\end{bmatrix}$Same answer, different perspectives. Row picture: compute dot products. Column picture: combine columns. Both are essential for ML intuition.
In neural networks, think of the weight matrix columns as learned feature detectors. The input vector says 'how much of each feature is present.' The output is the combination of feature responses. This column view makes layer operations intuitive: you're combining learned representations.
Matrix multiplication is where the transformation perspective truly shines. The product $AB$ represents applying transformation $B$ first, then transformation $A$.
Definition:
For $A$ (size $m \times n$) and $B$ (size $n \times p$), their product $C = AB$ has size $m \times p$:
$$C_{ij} = \sum_{k=1}^{n} A_{ik} B_{kj} = (\text{row } i \text{ of } A) \cdot (\text{column } j \text{ of } B)$$
Dimension rule: Inner dimensions must match. $(m \times \mathbf{n}) \cdot (\mathbf{n} \times p) = (m \times p)$
If $A$ is $2 \times 3$ and $B$ is $3 \times 4$, then $AB$ is $2 \times 4$. But $BA$ is undefined (4 ≠ 2).
Why this definition?
The definition enforces that $(AB)\mathbf{x} = A(B\mathbf{x})$ for any vector $\mathbf{x}$. We want the matrix product to behave exactly like function composition:
The product $AB$ should give the same result in one step: $\mathbf{y} = (AB)\mathbf{x}$
The row-by-column multiplication rule is the ONLY definition that makes this work.
$$S = \\begin{bmatrix} 2 & 0 \\\\ 0 & 1 \\end{bmatrix} \\text{ (scale x by 2)}$$
$$R = \\begin{bmatrix} 0 & -1 \\\\ 1 & 0 \\end{bmatrix} \\text{ (rotate 90°)}$$$$RS = \\begin{bmatrix} 0 & -1 \\\\ 1 & 0 \\end{bmatrix} \\begin{bmatrix} 2 & 0 \\\\ 0 & 1 \\end{bmatrix} = \\begin{bmatrix} 0 & -1 \\\\ 2 & 0 \\end{bmatrix}$$Reading right-to-left: first scale x by 2, then rotate 90°. The basis vector (1,0) goes to (2,0) after scaling, then to (0,2) after rotation—which is exactly column 1 of the product. Verify: column 2 should take (0,1) → (0,1) → (-1,0). ✓
In general, $AB eq BA$. 'Rotate then scale' differs from 'scale then rotate.' Even when both products are defined (square matrices of same size), the results typically differ. This reflects the physical reality that the order of operations matters in geometry.
Properties of matrix multiplication:
| Property | Statement | Note |
|---|---|---|
| NOT commutative | $AB | |
| eq BA$ generally | Order matters! | |
| Associative | $(AB)C = A(BC)$ | Grouping doesn't affect result |
| Distributive | $A(B+C) = AB + AC$ | Multiplication distributes over addition |
| Identity | $AI = IA = A$ | Identity matrix is neutral |
| Zero | $A \cdot O = O$ | Zero matrix annihilates |
Understanding matrix multiplication deeply requires seeing it from multiple angles. Here are four complementary perspectives:
1. Entry-by-entry (dot product view):
Each entry $(i,j)$ of $C = AB$ is the dot product of row $i$ of $A$ with column $j$ of $B$:
$$C_{ij} = \mathbf{a}i^T \cdot \mathbf{b}j = \sum_k A{ik} B{kj}$$
When useful: Computing individual entries, understanding computational complexity ($O(n^3)$ for $n \times n$ matrices).
In gradient descent, weight updates have the form $\Delta W = \eta \cdot \mathbf{error} \cdot \mathbf{input}^T$—an outer product! Each training example contributes a rank-1 update. Understanding this view clarifies how neural networks learn: they accumulate evidence from input-error correlations.
Understanding computational cost is essential for ML practitioners who work with matrices containing millions of parameters.
| Operation | Time Complexity | Memory | Notes |
|---|---|---|---|
| Addition $A + B$ ($m \times n$) | $O(mn)$ | $O(mn)$ | Element-wise, highly parallelizable |
| Scalar mult $cA$ | $O(mn)$ | $O(mn)$ | Element-wise |
| Transpose $A^T$ | $O(mn)$ | $O(mn)$ | In practice, often $O(1)$ via view |
| Matrix-vector $A\mathbf{x}$ ($m \times n$) | $O(mn)$ | $O(m + n)$ | Each output needs $n$ ops |
| Matrix-matrix $AB$ ($m \times n \times p$) | $O(mnp)$ | $O(mp)$ | Naive algorithm; faster exist |
| Square matrix mult ($n \times n$) | $O(n^3)$ | $O(n^2)$ | Strassen: $O(n^{2.807})$ |
The $n^3$ barrier:
Naive matrix multiplication of two $n \times n$ matrices requires $O(n^3)$ operations. For $n = 1000$, that's a billion operations. For $n = 10000$, it's a trillion.
Optimizations in practice:
Matrix multiplication is associative: $(AB)C = A(BC)$. But the order affects computation cost! Computing $ABC$ where $A$ is $10 \times 100$, $B$ is $100 \times 5$, $C$ is $5 \times 50$:
• $(AB)C$: $(10 \times 100 \times 5) + (10 \times 5 \times 50) = 5000 + 2500 = 7500$ ops • $A(BC)$: $(100 \times 5 \times 50) + (10 \times 100 \times 50) = 25000 + 50000 = 75000$ ops
Choosing the right order matters—10x difference here!
We've covered the essential operations on matrices and their geometric meanings. Here's a consolidated reference:
| Operation | Formula | Requirement |
|---|---|---|
| Addition | $(A+B){ij} = A{ij} + B_{ij}$ | Same dimensions |
| Scalar mult | $(cA){ij} = cA{ij}$ | Any matrix |
| Transpose | $(A^T){ij} = A{ji}$ | Any matrix |
| Product | $(AB){ij} = \sum_k A{ik}B_{kj}$ | Cols of $A$ = Rows of $B$ |
What's next:
Now that we can compute with matrices, we turn to interpreting what matrix multiplication tells us about the relationship between input and output space. The next page explores the deep meaning of matrix multiplication as transformation composition, including how to visualize and reason about chained transformations in ML.
You now have computational fluency with matrix operations and understand their geometric significance. These operations aren't arbitrary—they're the algebra of transformations. Next: diving deeper into matrix multiplication interpretation and what it reveals about composed transformations.