Loading learning content...
Matrix multiplication is one of the most fundamental operations in all of computing. It powers everything from 3D graphics transformations and physics simulations to machine learning algorithms and signal processing. Modern neural networks execute trillions of matrix multiplications during training and inference. Scientific simulations in weather prediction, molecular dynamics, and economic modeling all rely heavily on this single operation.
Yet despite its ubiquity, the standard algorithm for multiplying two n×n matrices requires O(n³) arithmetic operations—a cost that becomes astronomical as matrix sizes grow. For a 1,000×1,000 matrix, that's one billion operations. For a 10,000×10,000 matrix, one trillion. Understanding why this cubic complexity arises, and whether we can do better, is one of the most important questions in computational mathematics.
By the end of this page, you will thoroughly understand the standard matrix multiplication algorithm, derive its O(n³) complexity from first principles, see why every entry requires n multiplications, and understand the computational bottleneck that Strassen's algorithm will later address. You'll also appreciate why matrix multiplication is distinct from element-wise operations.
Before diving into the algorithm, we must establish precise mathematical foundations. A matrix is a rectangular array of numbers organized into rows and columns. An m×n matrix A has m rows and n columns, and we denote the entry in row i and column j as A[i][j] or A_{i,j}.
Notation conventions:
For square matrices (which we'll focus on for complexity analysis), we have n×n matrices where m = n = p.
| Matrix A | Matrix B | Result C | Valid? |
|---|---|---|---|
| 3×4 matrix | 4×5 matrix | 3×5 matrix | Yes — inner dimensions match (4=4) |
| 3×4 matrix | 5×4 matrix | N/A | No — inner dimensions differ (4≠5) |
| n×n matrix | n×n matrix | n×n matrix | Yes — square matrices always compatible |
| m×n matrix | n×p matrix | m×p matrix | Yes — general compatible case |
Matrix multiplication A × B is only defined when the number of columns in A equals the number of rows in B. If A is m×n, then B must be n×p for some p. This isn't a programming constraint—it's a mathematical requirement arising from how matrix multiplication is defined.
Memory layout considerations:
In computer memory, matrices are typically stored in one of two layouts:
This distinction profoundly affects cache performance. The standard algorithm's memory access patterns interact with cache hierarchy in ways that significantly impact real-world performance beyond the O(n³) theoretical bound.
Matrix multiplication is fundamentally different from element-wise multiplication. Each entry in the result matrix C is computed as a dot product of a row from A and a column from B.
Formal definition:
Given matrices A (m×n) and B (n×p), the product C = A × B is an m×p matrix where:
$$C[i][j] = \sum_{k=1}^{n} A[i][k] \cdot B[k][j]$$
In plain language: C[i][j] equals the sum of products of corresponding elements from the i-th row of A and the j-th column of B.
The dot product of two vectors [a₁, a₂, ..., aₙ] and [b₁, b₂, ..., bₙ] is a₁b₁ + a₂b₂ + ... + aₙbₙ. Matrix multiplication computes n² such dot products for n×n matrices—one for each position in the result matrix.
Concrete example with 2×2 matrices:
Let's multiply:
A = [1 2] B = [5 6]
[3 4] [7 8]
The result C = A × B:
C[1][1] = A[1][1]·B[1][1] + A[1][2]·B[2][1] = 1·5 + 2·7 = 5 + 14 = 19
C[1][2] = A[1][1]·B[1][2] + A[1][2]·B[2][2] = 1·6 + 2·8 = 6 + 16 = 22
C[2][1] = A[2][1]·B[1][1] + A[2][2]·B[2][1] = 3·5 + 4·7 = 15 + 28 = 43
C[2][2] = A[2][1]·B[1][2] + A[2][2]·B[2][2] = 3·6 + 4·8 = 18 + 32 = 50
Therefore:
C = [19 22]
[43 50]
Notice that computing each entry required exactly 2 multiplications and 1 addition (n multiplications and n-1 additions for n×n matrices).
The standard algorithm directly implements the mathematical definition. It uses three nested loops: one for each row of A, one for each column of B, and one for the summation over the shared dimension.
Algorithm:
function multiplyMatrices(A, B, n):
// Initialize result matrix C with zeros
C = new n×n matrix, all entries 0
// For each row i of A
for i = 1 to n:
// For each column j of B
for j = 1 to n:
// Compute dot product of row i and column j
sum = 0
for k = 1 to n:
sum = sum + A[i][k] * B[k][j]
C[i][j] = sum
return C
This algorithm is elegantly simple and directly mirrors the mathematical definition. Its correctness follows immediately from the formula C[i][j] = Σ A[i][k] · B[k][j].
1234567891011121314151617181920212223242526272829303132
def matrix_multiply(A: list[list[float]], B: list[list[float]]) -> list[list[float]]: """ Standard O(n³) matrix multiplication. Args: A: First matrix (n×n) B: Second matrix (n×n) Returns: C: Product matrix A × B (n×n) """ n = len(A) # Initialize result matrix with zeros C = [[0.0 for _ in range(n)] for _ in range(n)] # Triple nested loop - the signature of O(n³) complexity for i in range(n): # Iterate over rows of A for j in range(n): # Iterate over columns of B # Compute dot product of row i of A and column j of B dot_product = 0.0 for k in range(n): # Iterate over shared dimension dot_product += A[i][k] * B[k][j] C[i][j] = dot_product return C # Example usageA = [[1, 2], [3, 4]]B = [[5, 6], [7, 8]]C = matrix_multiply(A, B)print("Result:", C) # [[19, 22], [43, 50]]• The outer loop (i) selects which row of A we're working with • The middle loop (j) selects which column of B we're targeting • The innermost loop (k) accumulates the dot product across the shared dimension
This structure is fundamental—any algorithm that computes all n² entries of C must somehow account for n multiplications per entry.
Let's derive the time complexity with mathematical precision. For n×n matrices, we count the exact number of arithmetic operations.
Multiplication count:
Total multiplications = n × n × n = n³
Addition count:
Total additions = n² × (n-1) = n³ - n²
| n | Multiplications (n³) | Additions (n³-n²) | Total Operations |
|---|---|---|---|
| 2 | 8 | 4 | 12 |
| 10 | 1,000 | 900 | 1,900 |
| 100 | 1,000,000 | 990,000 | 1,990,000 |
| 1,000 | 10⁹ | ~10⁹ | ~2×10⁹ |
| 10,000 | 10¹² | ~10¹² | ~2×10¹² |
Asymptotic analysis:
Total operations T(n) = n³ + (n³ - n²) = 2n³ - n²
For large n, the n³ term dominates:
T(n) = 2n³ - n² = Θ(n³)
Therefore, the standard algorithm's time complexity is O(n³), and more precisely Θ(n³) since we can't do the computation with fewer operations using this approach.
Space complexity:
If doubling the matrix size (2n) increases time by 8× (since (2n)³ = 8n³), then: a 1-second computation on 1,000×1,000 matrices becomes 8 seconds for 2,000×2,000 matrices, 64 seconds for 4,000×4,000, and over 8 minutes for 8,000×8,000. This exponential degradation motivates the search for better algorithms.
At first glance, O(n³) seems like an unavoidable lower bound. The argument goes:
This reasoning is correct for the naive approach but contains a hidden assumption: that each entry must be computed independently. Strassen's insight was that entries are not independent—they share common subexpressions that can be computed once and reused.
The open question of optimal matrix multiplication:
Remarkably, the true lower bound for matrix multiplication remains unknown. We know:
The conjecture that matrix multiplication can be done in O(n^2) remains one of the great open problems in theoretical computer science.
While all loop orderings (ijk, ikj, jik, jki, kij, kji) give the same mathematical result, their performance differs dramatically due to cache behavior.
Understanding the issue:
In our standard (ijk) order:
When accessing B column-wise in row-major storage, each access potentially causes a cache miss because consecutive elements of a column are separated by n positions in memory.
| Loop Order | A Access Pattern | B Access Pattern | C Access Pattern | Cache Behavior |
|---|---|---|---|---|
| ijk | Row-wise ✓ | Column-wise ✗ | Row-wise ✓ | Poor |
| ikj | Row-wise ✓ | Row-wise ✓ | Row-wise ✓ | Good |
| jik | Column-wise ✗ | Column-wise ✗ | Column-wise ✗ | Very Poor |
| jki | Column-wise ✗ | Row-wise ✓ | Column-wise ✗ | Poor |
| kij | Row-wise ✓ | Row-wise ✓ | Row-wise ✓ | Good |
| kji | Column-wise ✗ | Row-wise ✓ | Column-wise ✗ | Poor |
1234567891011121314151617181920212223242526272829303132333435
def matrix_multiply_ikj(A: list[list[float]], B: list[list[float]]) -> list[list[float]]: """ Cache-friendly matrix multiplication using ikj loop order. Key insight: In ikj order, we access both matrices row-wise, which is cache-friendly for row-major storage (like Python lists). Instead of computing one complete C[i][j] at a time, we accumulate partial results across all C[i][*] simultaneously. """ n = len(A) C = [[0.0 for _ in range(n)] for _ in range(n)] for i in range(n): # For each row of A for k in range(n): # For each column of A / row of B # A[i][k] is loaded once for entire inner loop a_ik = A[i][k] for j in range(n): # For each column of B # B[k][j] accessed row-wise (cache-friendly!) # C[i][j] accessed row-wise (cache-friendly!) C[i][j] += a_ik * B[k][j] return C # Performance comparisonimport time def benchmark(multiply_func, n=500, name=""): A = [[float(i*n + j) for j in range(n)] for i in range(n)] B = [[float(i*n + j) for j in range(n)] for i in range(n)] start = time.time() C = multiply_func(A, B) elapsed = time.time() - start print(f"{name}: {elapsed:.2f} seconds")On modern processors, the cache-friendly ikj order can be 3-10× faster than the cache-unfriendly ijk order for large matrices. This is despite having identical O(n³) complexity—a powerful reminder that asymptotic analysis doesn't tell the whole story.
To further improve cache performance, we can use blocking (also called tiling). The idea is to partition matrices into smaller blocks that fit entirely in cache, then perform multiplication at the block level.
The blocking strategy:
The key insight: if b is chosen so blocks fit in cache (typically L2 cache), we maximize data reuse and minimize cache misses.
12345678910111213141516171819202122232425262728293031323334353637383940
def matrix_multiply_blocked(A: list[list[float]], B: list[list[float]], block_size: int = 64) -> list[list[float]]: """ Blocked (tiled) matrix multiplication for improved cache efficiency. Args: A, B: Input n×n matrices block_size: Size of blocks (should fit in L2 cache) The algorithm partitions matrices into blocks and performs multiplication at the block level: C_ij = Σ_k A_ik × B_kj (where subscripts denote blocks) """ n = len(A) C = [[0.0 for _ in range(n)] for _ in range(n)] # Iterate over block indices for i0 in range(0, n, block_size): # Block row of C/A for j0 in range(0, n, block_size): # Block column of C/B for k0 in range(0, n, block_size): # Block in shared dimension # Multiply block A[i0:i0+b, k0:k0+b] × B[k0:k0+b, j0:j0+b] # and accumulate into C[i0:i0+b, j0:j0+b] # Determine actual block boundaries (handle edge cases) i_end = min(i0 + block_size, n) j_end = min(j0 + block_size, n) k_end = min(k0 + block_size, n) # Standard multiplication within blocks for i in range(i0, i_end): for j in range(j0, j_end): for k in range(k0, k_end): C[i][j] += A[i][k] * B[k][j] return C # The block size should be tuned for the specific cache size.# Typical values: 32-128 for L1, 64-512 for L2Complexity analysis of blocked multiplication:
The asymptotic complexity remains O(n³), but the constant factor improves dramatically due to cache efficiency. For large matrices, blocked multiplication can be 10-50× faster than the naive approach.
Optimal block size selection:
Matrix multiplication has distinctive algebraic properties that differ significantly from scalar multiplication. Understanding these properties is crucial for designing algorithms and avoiding bugs.
The fact that AB ≠ BA in general has profound implications. When computing ABCD, the grouping (AB)(CD) versus A((BC)D) versus ((AB)C)D all give the same result (by associativity), but AB then BA would give a different result entirely. Divide-and-conquer algorithms exploit associativity while respecting non-commutativity.
Matrix multiplication pervades modern computing. Understanding where O(n³) becomes problematic motivates the search for better algorithms.
| Domain | Typical Matrix Sizes | Operation Frequency | Performance Impact |
|---|---|---|---|
| Machine Learning (Training) | 10,000 × 10,000+ | Billions per epoch | Days/weeks of training time |
| Computer Graphics (3D) | 4 × 4 (transforms) | Millions per frame | 60 FPS requirement |
| Scientific Simulation | 100,000+ × 100,000+ | Thousands per timestep | Weeks of simulation |
| Signal Processing | Varies (FFT as matrix) | Real-time streaming | Latency constraints |
| Cryptography | Large sparse matrices | Per operation | Security/speed tradeoff |
| Recommendation Systems | Users × Items | Continuous updates | Latency for predictions |
Case study: Deep Learning
Modern neural networks are essentially matrix multiplication machines. A single forward pass through a transformer model like GPT involves:
For GPT-3 with 175 billion parameters, training required approximately 10²³ floating-point operations—most of which were matrix multiplications. Even a 10% improvement in matrix multiplication efficiency translates to massive energy and cost savings.
Training large language models costs millions of dollars in compute. If Strassen's algorithm (or better) can reduce matrix multiplication costs by even 10-20% in practice, that translates to hundreds of thousands of dollars saved per training run—plus significant environmental impact from reduced energy consumption.
We've established a thorough understanding of standard matrix multiplication—the computational workhorse that Strassen's algorithm aims to improve.
What's next:
With the standard algorithm firmly understood, we're ready to explore Strassen's revolutionary insight. In 1969, Volker Strassen discovered that by cleverly combining matrix entries, he could reduce the number of multiplications needed for 2×2 matrix multiplication from 8 to 7. This seemingly minor improvement—saving just one multiplication—has cascading effects when applied recursively, yielding an asymptotic complexity of O(n^2.807).
The next page provides an overview of Strassen's algorithm, setting up the "divide" phase of this divide-and-conquer approach.
You now fully understand standard matrix multiplication—its algorithm, O(n³) complexity, cache optimizations, properties, and real-world significance. This foundation is essential for appreciating Strassen's improvement and the broader question of how fast matrix multiplication can truly be.