Data Structures & AlgorithmsDivide and Conquer

Matrix Multiplication & Strassen's Algorithm

LevelAdvanced

Duration75 mins

TopicDivide and Conquer

1 / 4

Standard Matrix Multiplication — O(n³)

The Foundation of Linear Algebra in Computing

Matrix multiplication is one of the most fundamental operations in all of computing. It powers everything from 3D graphics transformations and physics simulations to machine learning algorithms and signal processing. Modern neural networks execute trillions of matrix multiplications during training and inference. Scientific simulations in weather prediction, molecular dynamics, and economic modeling all rely heavily on this single operation.

Yet despite its ubiquity, the standard algorithm for multiplying two n×n matrices requires O(n³) arithmetic operations—a cost that becomes astronomical as matrix sizes grow. For a 1,000×1,000 matrix, that's one billion operations. For a 10,000×10,000 matrix, one trillion. Understanding why this cubic complexity arises, and whether we can do better, is one of the most important questions in computational mathematics.

What You Will Learn

By the end of this page, you will thoroughly understand the standard matrix multiplication algorithm, derive its O(n³) complexity from first principles, see why every entry requires n multiplications, and understand the computational bottleneck that Strassen's algorithm will later address. You'll also appreciate why matrix multiplication is distinct from element-wise operations.

Matrix Representation and Notation

Before diving into the algorithm, we must establish precise mathematical foundations. A matrix is a rectangular array of numbers organized into rows and columns. An m×n matrix A has m rows and n columns, and we denote the entry in row i and column j as A[i][j] or A_{i,j}.

Notation conventions:

A is an m×n matrix (m rows, n columns)
B is an n×p matrix (n rows, p columns)
C = A × B will be an m×p matrix
Indices typically start from 1 in mathematical literature, but from 0 in programming

For square matrices (which we'll focus on for complexity analysis), we have n×n matrices where m = n = p.

Matrix Dimension Compatibility for Multiplication
Matrix A	Matrix B	Result C	Valid?
3×4 matrix	4×5 matrix	3×5 matrix	Yes — inner dimensions match (4=4)
3×4 matrix	5×4 matrix	N/A	No — inner dimensions differ (4≠5)
n×n matrix	n×n matrix	n×n matrix	Yes — square matrices always compatible
m×n matrix	n×p matrix	m×p matrix	Yes — general compatible case

Dimension Compatibility is Non-Negotiable

Matrix multiplication A × B is only defined when the number of columns in A equals the number of rows in B. If A is m×n, then B must be n×p for some p. This isn't a programming constraint—it's a mathematical requirement arising from how matrix multiplication is defined.

Memory layout considerations:

In computer memory, matrices are typically stored in one of two layouts:

Row-major order: Elements of each row are stored contiguously. Used by C, C++, and Python (NumPy).
Column-major order: Elements of each column are stored contiguously. Used by Fortran, MATLAB, and Julia.

This distinction profoundly affects cache performance. The standard algorithm's memory access patterns interact with cache hierarchy in ways that significantly impact real-world performance beyond the O(n³) theoretical bound.

The Mathematical Definition of Matrix Multiplication

Matrix multiplication is fundamentally different from element-wise multiplication. Each entry in the result matrix C is computed as a dot product of a row from A and a column from B.

Formal definition:

Given matrices A (m×n) and B (n×p), the product C = A × B is an m×p matrix where:

$$C[i][j] = \sum_{k=1}^{n} A[i][k] \cdot B[k][j]$$

In plain language: C[i][j] equals the sum of products of corresponding elements from the i-th row of A and the j-th column of B.

Understanding the Dot Product Operation

The dot product of two vectors [a₁, a₂, ..., aₙ] and [b₁, b₂, ..., bₙ] is a₁b₁ + a₂b₂ + ... + aₙbₙ. Matrix multiplication computes n² such dot products for n×n matrices—one for each position in the result matrix.

Concrete example with 2×2 matrices:

Let's multiply:

A = [1  2]    B = [5  6]
    [3  4]        [7  8]

The result C = A × B:

C[1][1] = A[1][1]·B[1][1] + A[1][2]·B[2][1] = 1·5 + 2·7 = 5 + 14 = 19
C[1][2] = A[1][1]·B[1][2] + A[1][2]·B[2][2] = 1·6 + 2·8 = 6 + 16 = 22
C[2][1] = A[2][1]·B[1][1] + A[2][2]·B[2][1] = 3·5 + 4·7 = 15 + 28 = 43
C[2][2] = A[2][1]·B[1][2] + A[2][2]·B[2][2] = 3·6 + 4·8 = 18 + 32 = 50

Therefore:

C = [19  22]
    [43  50]

Notice that computing each entry required exactly 2 multiplications and 1 addition (n multiplications and n-1 additions for n×n matrices).

The Standard Matrix Multiplication Algorithm

The standard algorithm directly implements the mathematical definition. It uses three nested loops: one for each row of A, one for each column of B, and one for the summation over the shared dimension.

Algorithm:

function multiplyMatrices(A, B, n):
    // Initialize result matrix C with zeros
    C = new n×n matrix, all entries 0
    
    // For each row i of A
    for i = 1 to n:
        // For each column j of B
        for j = 1 to n:
            // Compute dot product of row i and column j
            sum = 0
            for k = 1 to n:
                sum = sum + A[i][k] * B[k][j]
            C[i][j] = sum
    
    return C

This algorithm is elegantly simple and directly mirrors the mathematical definition. Its correctness follows immediately from the formula C[i][j] = Σ A[i][k] · B[k][j].

standard_matrix_multiply
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
def matrix_multiply(A: list[list[float]], B: list[list[float]]) -> list[list[float]]:
    """
    Standard O(n³) matrix multiplication.
    
    Args:
        A: First matrix (n×n)
        B: Second matrix (n×n)
    
    Returns:
        C: Product matrix A × B (n×n)
    """
    n = len(A)
    
    # Initialize result matrix with zeros
    C = [[0.0 for _ in range(n)] for _ in range(n)]
    
    # Triple nested loop - the signature of O(n³) complexity
    for i in range(n):           # Iterate over rows of A
        for j in range(n):       # Iterate over columns of B
            # Compute dot product of row i of A and column j of B
            dot_product = 0.0
            for k in range(n):   # Iterate over shared dimension
                dot_product += A[i][k] * B[k][j]
            C[i][j] = dot_product
    
    return C
 
# Example usage
A = [[1, 2], [3, 4]]
B = [[5, 6], [7, 8]]
C = matrix_multiply(A, B)
print("Result:", C)  # [[19, 22], [43, 50]]

The Three Loops Have Distinct Purposes

• The outer loop (i) selects which row of A we're working with • The middle loop (j) selects which column of B we're targeting • The innermost loop (k) accumulates the dot product across the shared dimension

This structure is fundamental—any algorithm that computes all n² entries of C must somehow account for n multiplications per entry.

Rigorous Complexity Analysis — Deriving O(n³)

Let's derive the time complexity with mathematical precision. For n×n matrices, we count the exact number of arithmetic operations.

Multiplication count:

The outer loop runs n times (i = 1 to n)
For each i, the middle loop runs n times (j = 1 to n)
For each (i,j), the inner loop runs n times (k = 1 to n)
Each inner loop iteration performs exactly 1 multiplication

Total multiplications = n × n × n = n³

Addition count:

For each (i,j), we sum n products
Summing n numbers requires n-1 additions
There are n² such sums

Total additions = n² × (n-1) = n³ - n²

Operation Counts for n×n Matrix Multiplication
n	Multiplications (n³)	Additions (n³-n²)	Total Operations
2	8	4	12
10	1,000	900	1,900
100	1,000,000	990,000	1,990,000
1,000	10⁹	~10⁹	~2×10⁹
10,000	10¹²	~10¹²	~2×10¹²

Asymptotic analysis:

Total operations T(n) = n³ + (n³ - n²) = 2n³ - n²

For large n, the n³ term dominates:

T(n) = 2n³ - n² = Θ(n³)

Therefore, the standard algorithm's time complexity is O(n³), and more precisely Θ(n³) since we can't do the computation with fewer operations using this approach.

Space complexity:

Input matrices: 2n² space
Output matrix: n² space
Auxiliary variables: O(1)
Total: Θ(n²) for the output (input space is given)

Cubic Growth is Brutal

If doubling the matrix size (2n) increases time by 8× (since (2n)³ = 8n³), then: a 1-second computation on 1,000×1,000 matrices becomes 8 seconds for 2,000×2,000 matrices, 64 seconds for 4,000×4,000, and over 8 minutes for 8,000×8,000. This exponential degradation motivates the search for better algorithms.

Why n³ Appears Inevitable (But Isn't)

At first glance, O(n³) seems like an unavoidable lower bound. The argument goes:

The result matrix C has n² entries
Each entry C[i][j] is defined as a sum of n products
Therefore, we need at least n² × n = n³ operations

This reasoning is correct for the naive approach but contains a hidden assumption: that each entry must be computed independently. Strassen's insight was that entries are not independent—they share common subexpressions that can be computed once and reused.

Why n³ Seems Necessary

•n² entries need to be computed
•Each entry is a sum of n products
•Products appear independent
•No obvious sharing between entries
•Direct computation requires n³ multiplications

Why It Can Be Beaten

•Matrix structure creates dependencies
•Block decomposition reveals sharing
•Clever algebraic identities exist
•D&C enables subproblem reuse
•Strassen achieves O(n^2.807)

The open question of optimal matrix multiplication:

Remarkably, the true lower bound for matrix multiplication remains unknown. We know:

Upper bound: O(n^2.3728596) — the best known algorithm (Alman-Williams, 2020)
Lower bound: Ω(n²) — we must at least read and write n² elements
Gap: The true complexity lies somewhere between n² and n^2.373

The conjecture that matrix multiplication can be done in O(n^2) remains one of the great open problems in theoretical computer science.

Loop Order and Cache Effects

While all loop orderings (ijk, ikj, jik, jki, kij, kji) give the same mathematical result, their performance differs dramatically due to cache behavior.

Understanding the issue:

In our standard (ijk) order:

A[i][k] accesses are row-wise (cache-friendly in row-major)
B[k][j] accesses are column-wise (cache-unfriendly in row-major)

When accessing B column-wise in row-major storage, each access potentially causes a cache miss because consecutive elements of a column are separated by n positions in memory.

Loop Order Performance Comparison (Row-Major Storage)
Loop Order	A Access Pattern	B Access Pattern	C Access Pattern	Cache Behavior
ijk	Row-wise ✓	Column-wise ✗	Row-wise ✓	Poor
ikj	Row-wise ✓	Row-wise ✓	Row-wise ✓	Good
jik	Column-wise ✗	Column-wise ✗	Column-wise ✗	Very Poor
jki	Column-wise ✗	Row-wise ✓	Column-wise ✗	Poor
kij	Row-wise ✓	Row-wise ✓	Row-wise ✓	Good
kji	Column-wise ✗	Row-wise ✓	Column-wise ✗	Poor

cache_friendly_matrix_multiply
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def matrix_multiply_ikj(A: list[list[float]], B: list[list[float]]) -> list[list[float]]:
    """
    Cache-friendly matrix multiplication using ikj loop order.
    
    Key insight: In ikj order, we access both matrices row-wise,
    which is cache-friendly for row-major storage (like Python lists).
    
    Instead of computing one complete C[i][j] at a time,
    we accumulate partial results across all C[i][*] simultaneously.
    """
    n = len(A)
    C = [[0.0 for _ in range(n)] for _ in range(n)]
    
    for i in range(n):           # For each row of A
        for k in range(n):       # For each column of A / row of B
            # A[i][k] is loaded once for entire inner loop
            a_ik = A[i][k]
            for j in range(n):   # For each column of B
                # B[k][j] accessed row-wise (cache-friendly!)
                # C[i][j] accessed row-wise (cache-friendly!)
                C[i][j] += a_ik * B[k][j]
    
    return C
 
# Performance comparison
import time
 
def benchmark(multiply_func, n=500, name=""):
    A = [[float(i*n + j) for j in range(n)] for i in range(n)]
    B = [[float(i*n + j) for j in range(n)] for i in range(n)]
    
    start = time.time()
    C = multiply_func(A, B)
    elapsed = time.time() - start
    print(f"{name}: {elapsed:.2f} seconds")

Real-World Performance Difference

On modern processors, the cache-friendly ikj order can be 3-10× faster than the cache-unfriendly ijk order for large matrices. This is despite having identical O(n³) complexity—a powerful reminder that asymptotic analysis doesn't tell the whole story.

Blocked (Tiled) Matrix Multiplication

To further improve cache performance, we can use blocking (also called tiling). The idea is to partition matrices into smaller blocks that fit entirely in cache, then perform multiplication at the block level.

The blocking strategy:

Divide each n×n matrix into (n/b)² blocks of size b×b
Treat each block as a single element in a "super-matrix"
Perform matrix multiplication on the block level
Each block multiplication uses the standard algorithm

The key insight: if b is chosen so blocks fit in cache (typically L2 cache), we maximize data reuse and minimize cache misses.

blocked_matrix_multiply
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
def matrix_multiply_blocked(A: list[list[float]], B: list[list[float]], 
                             block_size: int = 64) -> list[list[float]]:
    """
    Blocked (tiled) matrix multiplication for improved cache efficiency.
    
    Args:
        A, B: Input n×n matrices
        block_size: Size of blocks (should fit in L2 cache)
    
    The algorithm partitions matrices into blocks and performs
    multiplication at the block level:
    
    C_ij = Σ_k A_ik × B_kj  (where subscripts denote blocks)
    """
    n = len(A)
    C = [[0.0 for _ in range(n)] for _ in range(n)]
    
    # Iterate over block indices
    for i0 in range(0, n, block_size):           # Block row of C/A
        for j0 in range(0, n, block_size):       # Block column of C/B
            for k0 in range(0, n, block_size):   # Block in shared dimension
                
                # Multiply block A[i0:i0+b, k0:k0+b] × B[k0:k0+b, j0:j0+b]
                # and accumulate into C[i0:i0+b, j0:j0+b]
                
                # Determine actual block boundaries (handle edge cases)
                i_end = min(i0 + block_size, n)
                j_end = min(j0 + block_size, n)
                k_end = min(k0 + block_size, n)
                
                # Standard multiplication within blocks
                for i in range(i0, i_end):
                    for j in range(j0, j_end):
                        for k in range(k0, k_end):
                            C[i][j] += A[i][k] * B[k][j]
    
    return C
 
# The block size should be tuned for the specific cache size.
# Typical values: 32-128 for L1, 64-512 for L2

Complexity analysis of blocked multiplication:

We have (n/b)³ block multiplications
Each block multiplication costs O(b³)
Total: (n/b)³ × b³ = n³ operations

The asymptotic complexity remains O(n³), but the constant factor improves dramatically due to cache efficiency. For large matrices, blocked multiplication can be 10-50× faster than the naive approach.

Optimal block size selection:

Too small: Overhead from loop indices and block management
Too large: Blocks don't fit in cache, losing the benefit
Optimal: Largest b such that 3b² floats fit in cache (for A, B, C blocks)

Algebraic Properties of Matrix Multiplication

Matrix multiplication has distinctive algebraic properties that differ significantly from scalar multiplication. Understanding these properties is crucial for designing algorithms and avoiding bugs.

Properties Matrix Multiplication DOES Have

•Associativity: (AB)C = A(BC) — We can group multiplications differently, which enables divide-and-conquer
•Distributivity: A(B + C) = AB + AC and (A + B)C = AC + BC — Addition and multiplication interact predictably
•Identity element: AI = IA = A where I is the identity matrix — Multiplying by identity has no effect
•Zero element: A·0 = 0·A = 0 where 0 is the zero matrix — Multiplying by zero gives zero

Properties Matrix Multiplication LACKS

•Not commutative: AB ≠ BA in general — Order matters! This is crucial for algorithm design
•Not always invertible: Not every matrix has an inverse — Singular matrices have no inverse
•Zero divisors exist: AB = 0 doesn't imply A = 0 or B = 0 — Non-zero matrices can multiply to zero

Non-Commutativity is Critical

The fact that AB ≠ BA in general has profound implications. When computing ABCD, the grouping (AB)(CD) versus A((BC)D) versus ((AB)C)D all give the same result (by associativity), but AB then BA would give a different result entirely. Divide-and-conquer algorithms exploit associativity while respecting non-commutativity.

Real-World Applications of Matrix Multiplication

Matrix multiplication pervades modern computing. Understanding where O(n³) becomes problematic motivates the search for better algorithms.

Matrix Multiplication in Key Applications
Domain	Typical Matrix Sizes	Operation Frequency	Performance Impact
Machine Learning (Training)	10,000 × 10,000+	Billions per epoch	Days/weeks of training time
Computer Graphics (3D)	4 × 4 (transforms)	Millions per frame	60 FPS requirement
Scientific Simulation	100,000+ × 100,000+	Thousands per timestep	Weeks of simulation
Signal Processing	Varies (FFT as matrix)	Real-time streaming	Latency constraints
Cryptography	Large sparse matrices	Per operation	Security/speed tradeoff
Recommendation Systems	Users × Items	Continuous updates	Latency for predictions

Case study: Deep Learning

Modern neural networks are essentially matrix multiplication machines. A single forward pass through a transformer model like GPT involves:

Attention: Computing QK^T (n × d) × (d × n) = O(n²d) per head
Feed-forward: (batch × hidden₁) × (hidden₁ × hidden₂)
Multiply by number of layers, attention heads, and training iterations

For GPT-3 with 175 billion parameters, training required approximately 10²³ floating-point operations—most of which were matrix multiplications. Even a 10% improvement in matrix multiplication efficiency translates to massive energy and cost savings.

Why Faster Algorithms Matter

Training large language models costs millions of dollars in compute. If Strassen's algorithm (or better) can reduce matrix multiplication costs by even 10-20% in practice, that translates to hundreds of thousands of dollars saved per training run—plus significant environmental impact from reduced energy consumption.

Summary: The O(n³) Baseline

We've established a thorough understanding of standard matrix multiplication—the computational workhorse that Strassen's algorithm aims to improve.

Key Takeaways

•The standard algorithm is Θ(n³) — Exactly n³ multiplications and n³-n² additions for n×n matrices
•This complexity arises from the definition — n² entries, each requiring n products to compute
•Cache behavior matters enormously — Loop order and blocking can provide 10-50× speedups without changing complexity
•n³ seems necessary but isn't — The block structure of matrices enables sharing that breaks the apparent barrier
•Applications demand better — Deep learning, scientific computing, and graphics all suffer from cubic scaling
•The true lower bound is unknown — Somewhere between n² and n^2.373

What's next:

With the standard algorithm firmly understood, we're ready to explore Strassen's revolutionary insight. In 1969, Volker Strassen discovered that by cleverly combining matrix entries, he could reduce the number of multiplications needed for 2×2 matrix multiplication from 8 to 7. This seemingly minor improvement—saving just one multiplication—has cascading effects when applied recursively, yielding an asymptotic complexity of O(n^2.807).

The next page provides an overview of Strassen's algorithm, setting up the "divide" phase of this divide-and-conquer approach.

Page Complete

You now fully understand standard matrix multiplication—its algorithm, O(n³) complexity, cache optimizations, properties, and real-world significance. This foundation is essential for appreciating Strassen's improvement and the broader question of how fast matrix multiplication can truly be.

1 / 4

Loading learning content...

Data Structures & AlgorithmsDivide and Conquer

Matrix Multiplication & Strassen's Algorithm

LevelAdvanced

Duration75 mins

TopicDivide and Conquer

1 / 4

Standard Matrix Multiplication — O(n³)

The Foundation of Linear Algebra in Computing

What You Will Learn

Matrix Representation and Notation

Notation conventions:

A is an m×n matrix (m rows, n columns)
B is an n×p matrix (n rows, p columns)
C = A × B will be an m×p matrix
Indices typically start from 1 in mathematical literature, but from 0 in programming

For square matrices (which we'll focus on for complexity analysis), we have n×n matrices where m = n = p.

Matrix Dimension Compatibility for Multiplication
Matrix A	Matrix B	Result C	Valid?
3×4 matrix	4×5 matrix	3×5 matrix	Yes — inner dimensions match (4=4)
3×4 matrix	5×4 matrix	N/A	No — inner dimensions differ (4≠5)
n×n matrix	n×n matrix	n×n matrix	Yes — square matrices always compatible
m×n matrix	n×p matrix	m×p matrix	Yes — general compatible case

Dimension Compatibility is Non-Negotiable

Memory layout considerations:

In computer memory, matrices are typically stored in one of two layouts:

Row-major order: Elements of each row are stored contiguously. Used by C, C++, and Python (NumPy).
Column-major order: Elements of each column are stored contiguously. Used by Fortran, MATLAB, and Julia.

The Mathematical Definition of Matrix Multiplication

Matrix multiplication is fundamentally different from element-wise multiplication. Each entry in the result matrix C is computed as a dot product of a row from A and a column from B.

Formal definition:

Given matrices A (m×n) and B (n×p), the product C = A × B is an m×p matrix where:

$$C[i][j] = \sum_{k=1}^{n} A[i][k] \cdot B[k][j]$$

In plain language: C[i][j] equals the sum of products of corresponding elements from the i-th row of A and the j-th column of B.

Understanding the Dot Product Operation

Concrete example with 2×2 matrices:

Let's multiply:

A = [1  2]    B = [5  6]
    [3  4]        [7  8]

The result C = A × B:

C[1][1] = A[1][1]·B[1][1] + A[1][2]·B[2][1] = 1·5 + 2·7 = 5 + 14 = 19
C[1][2] = A[1][1]·B[1][2] + A[1][2]·B[2][2] = 1·6 + 2·8 = 6 + 16 = 22
C[2][1] = A[2][1]·B[1][1] + A[2][2]·B[2][1] = 3·5 + 4·7 = 15 + 28 = 43
C[2][2] = A[2][1]·B[1][2] + A[2][2]·B[2][2] = 3·6 + 4·8 = 18 + 32 = 50

Therefore:

C = [19  22]
    [43  50]

Notice that computing each entry required exactly 2 multiplications and 1 addition (n multiplications and n-1 additions for n×n matrices).

The Standard Matrix Multiplication Algorithm

Algorithm:

function multiplyMatrices(A, B, n):
    // Initialize result matrix C with zeros
    C = new n×n matrix, all entries 0
    
    // For each row i of A
    for i = 1 to n:
        // For each column j of B
        for j = 1 to n:
            // Compute dot product of row i and column j
            sum = 0
            for k = 1 to n:
                sum = sum + A[i][k] * B[k][j]
            C[i][j] = sum
    
    return C

This algorithm is elegantly simple and directly mirrors the mathematical definition. Its correctness follows immediately from the formula C[i][j] = Σ A[i][k] · B[k][j].

standard_matrix_multiply
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
def matrix_multiply(A: list[list[float]], B: list[list[float]]) -> list[list[float]]:
    """
    Standard O(n³) matrix multiplication.
    
    Args:
        A: First matrix (n×n)
        B: Second matrix (n×n)
    
    Returns:
        C: Product matrix A × B (n×n)
    """
    n = len(A)
    
    # Initialize result matrix with zeros
    C = [[0.0 for _ in range(n)] for _ in range(n)]
    
    # Triple nested loop - the signature of O(n³) complexity
    for i in range(n):           # Iterate over rows of A
        for j in range(n):       # Iterate over columns of B
            # Compute dot product of row i of A and column j of B
            dot_product = 0.0
            for k in range(n):   # Iterate over shared dimension
                dot_product += A[i][k] * B[k][j]
            C[i][j] = dot_product
    
    return C
 
# Example usage
A = [[1, 2], [3, 4]]
B = [[5, 6], [7, 8]]
C = matrix_multiply(A, B)
print("Result:", C)  # [[19, 22], [43, 50]]

The Three Loops Have Distinct Purposes

This structure is fundamental—any algorithm that computes all n² entries of C must somehow account for n multiplications per entry.

Rigorous Complexity Analysis — Deriving O(n³)

Let's derive the time complexity with mathematical precision. For n×n matrices, we count the exact number of arithmetic operations.

Multiplication count:

The outer loop runs n times (i = 1 to n)
For each i, the middle loop runs n times (j = 1 to n)
For each (i,j), the inner loop runs n times (k = 1 to n)
Each inner loop iteration performs exactly 1 multiplication

Total multiplications = n × n × n = n³

Addition count:

For each (i,j), we sum n products
Summing n numbers requires n-1 additions
There are n² such sums

Total additions = n² × (n-1) = n³ - n²

Operation Counts for n×n Matrix Multiplication
n	Multiplications (n³)	Additions (n³-n²)	Total Operations
2	8	4	12
10	1,000	900	1,900
100	1,000,000	990,000	1,990,000
1,000	10⁹	~10⁹	~2×10⁹
10,000	10¹²	~10¹²	~2×10¹²

Asymptotic analysis:

Total operations T(n) = n³ + (n³ - n²) = 2n³ - n²

For large n, the n³ term dominates:

T(n) = 2n³ - n² = Θ(n³)

Therefore, the standard algorithm's time complexity is O(n³), and more precisely Θ(n³) since we can't do the computation with fewer operations using this approach.

Space complexity:

Input matrices: 2n² space
Output matrix: n² space
Auxiliary variables: O(1)
Total: Θ(n²) for the output (input space is given)

Cubic Growth is Brutal

Why n³ Appears Inevitable (But Isn't)

At first glance, O(n³) seems like an unavoidable lower bound. The argument goes:

The result matrix C has n² entries
Each entry C[i][j] is defined as a sum of n products
Therefore, we need at least n² × n = n³ operations

Why n³ Seems Necessary

•n² entries need to be computed
•Each entry is a sum of n products
•Products appear independent
•No obvious sharing between entries
•Direct computation requires n³ multiplications

Why It Can Be Beaten

•Matrix structure creates dependencies
•Block decomposition reveals sharing
•Clever algebraic identities exist
•D&C enables subproblem reuse
•Strassen achieves O(n^2.807)

The open question of optimal matrix multiplication:

Remarkably, the true lower bound for matrix multiplication remains unknown. We know:

Upper bound: O(n^2.3728596) — the best known algorithm (Alman-Williams, 2020)
Lower bound: Ω(n²) — we must at least read and write n² elements
Gap: The true complexity lies somewhere between n² and n^2.373

The conjecture that matrix multiplication can be done in O(n^2) remains one of the great open problems in theoretical computer science.

Loop Order and Cache Effects

While all loop orderings (ijk, ikj, jik, jki, kij, kji) give the same mathematical result, their performance differs dramatically due to cache behavior.

Understanding the issue:

In our standard (ijk) order:

A[i][k] accesses are row-wise (cache-friendly in row-major)
B[k][j] accesses are column-wise (cache-unfriendly in row-major)

When accessing B column-wise in row-major storage, each access potentially causes a cache miss because consecutive elements of a column are separated by n positions in memory.

Loop Order Performance Comparison (Row-Major Storage)
Loop Order	A Access Pattern	B Access Pattern	C Access Pattern	Cache Behavior
ijk	Row-wise ✓	Column-wise ✗	Row-wise ✓	Poor
ikj	Row-wise ✓	Row-wise ✓	Row-wise ✓	Good
jik	Column-wise ✗	Column-wise ✗	Column-wise ✗	Very Poor
jki	Column-wise ✗	Row-wise ✓	Column-wise ✗	Poor
kij	Row-wise ✓	Row-wise ✓	Row-wise ✓	Good
kji	Column-wise ✗	Row-wise ✓	Column-wise ✗	Poor

cache_friendly_matrix_multiply
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def matrix_multiply_ikj(A: list[list[float]], B: list[list[float]]) -> list[list[float]]:
    """
    Cache-friendly matrix multiplication using ikj loop order.
    
    Key insight: In ikj order, we access both matrices row-wise,
    which is cache-friendly for row-major storage (like Python lists).
    
    Instead of computing one complete C[i][j] at a time,
    we accumulate partial results across all C[i][*] simultaneously.
    """
    n = len(A)
    C = [[0.0 for _ in range(n)] for _ in range(n)]
    
    for i in range(n):           # For each row of A
        for k in range(n):       # For each column of A / row of B
            # A[i][k] is loaded once for entire inner loop
            a_ik = A[i][k]
            for j in range(n):   # For each column of B
                # B[k][j] accessed row-wise (cache-friendly!)
                # C[i][j] accessed row-wise (cache-friendly!)
                C[i][j] += a_ik * B[k][j]
    
    return C
 
# Performance comparison
import time
 
def benchmark(multiply_func, n=500, name=""):
    A = [[float(i*n + j) for j in range(n)] for i in range(n)]
    B = [[float(i*n + j) for j in range(n)] for i in range(n)]
    
    start = time.time()
    C = multiply_func(A, B)
    elapsed = time.time() - start
    print(f"{name}: {elapsed:.2f} seconds")

Real-World Performance Difference

Blocked (Tiled) Matrix Multiplication

The blocking strategy:

Divide each n×n matrix into (n/b)² blocks of size b×b
Treat each block as a single element in a "super-matrix"
Perform matrix multiplication on the block level
Each block multiplication uses the standard algorithm

The key insight: if b is chosen so blocks fit in cache (typically L2 cache), we maximize data reuse and minimize cache misses.

blocked_matrix_multiply
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
def matrix_multiply_blocked(A: list[list[float]], B: list[list[float]], 
                             block_size: int = 64) -> list[list[float]]:
    """
    Blocked (tiled) matrix multiplication for improved cache efficiency.
    
    Args:
        A, B: Input n×n matrices
        block_size: Size of blocks (should fit in L2 cache)
    
    The algorithm partitions matrices into blocks and performs
    multiplication at the block level:
    
    C_ij = Σ_k A_ik × B_kj  (where subscripts denote blocks)
    """
    n = len(A)
    C = [[0.0 for _ in range(n)] for _ in range(n)]
    
    # Iterate over block indices
    for i0 in range(0, n, block_size):           # Block row of C/A
        for j0 in range(0, n, block_size):       # Block column of C/B
            for k0 in range(0, n, block_size):   # Block in shared dimension
                
                # Multiply block A[i0:i0+b, k0:k0+b] × B[k0:k0+b, j0:j0+b]
                # and accumulate into C[i0:i0+b, j0:j0+b]
                
                # Determine actual block boundaries (handle edge cases)
                i_end = min(i0 + block_size, n)
                j_end = min(j0 + block_size, n)
                k_end = min(k0 + block_size, n)
                
                # Standard multiplication within blocks
                for i in range(i0, i_end):
                    for j in range(j0, j_end):
                        for k in range(k0, k_end):
                            C[i][j] += A[i][k] * B[k][j]
    
    return C
 
# The block size should be tuned for the specific cache size.
# Typical values: 32-128 for L1, 64-512 for L2

Complexity analysis of blocked multiplication:

We have (n/b)³ block multiplications
Each block multiplication costs O(b³)
Total: (n/b)³ × b³ = n³ operations

Optimal block size selection:

Too small: Overhead from loop indices and block management
Too large: Blocks don't fit in cache, losing the benefit
Optimal: Largest b such that 3b² floats fit in cache (for A, B, C blocks)

Algebraic Properties of Matrix Multiplication

Matrix multiplication has distinctive algebraic properties that differ significantly from scalar multiplication. Understanding these properties is crucial for designing algorithms and avoiding bugs.

Properties Matrix Multiplication DOES Have

•Associativity: (AB)C = A(BC) — We can group multiplications differently, which enables divide-and-conquer
•Distributivity: A(B + C) = AB + AC and (A + B)C = AC + BC — Addition and multiplication interact predictably
•Identity element: AI = IA = A where I is the identity matrix — Multiplying by identity has no effect
•Zero element: A·0 = 0·A = 0 where 0 is the zero matrix — Multiplying by zero gives zero

Properties Matrix Multiplication LACKS

•Not commutative: AB ≠ BA in general — Order matters! This is crucial for algorithm design
•Not always invertible: Not every matrix has an inverse — Singular matrices have no inverse
•Zero divisors exist: AB = 0 doesn't imply A = 0 or B = 0 — Non-zero matrices can multiply to zero

Non-Commutativity is Critical

Real-World Applications of Matrix Multiplication

Matrix multiplication pervades modern computing. Understanding where O(n³) becomes problematic motivates the search for better algorithms.

Matrix Multiplication in Key Applications
Domain	Typical Matrix Sizes	Operation Frequency	Performance Impact
Machine Learning (Training)	10,000 × 10,000+	Billions per epoch	Days/weeks of training time
Computer Graphics (3D)	4 × 4 (transforms)	Millions per frame	60 FPS requirement
Scientific Simulation	100,000+ × 100,000+	Thousands per timestep	Weeks of simulation
Signal Processing	Varies (FFT as matrix)	Real-time streaming	Latency constraints
Cryptography	Large sparse matrices	Per operation	Security/speed tradeoff
Recommendation Systems	Users × Items	Continuous updates	Latency for predictions

Case study: Deep Learning

Modern neural networks are essentially matrix multiplication machines. A single forward pass through a transformer model like GPT involves:

Attention: Computing QK^T (n × d) × (d × n) = O(n²d) per head
Feed-forward: (batch × hidden₁) × (hidden₁ × hidden₂)
Multiply by number of layers, attention heads, and training iterations

Why Faster Algorithms Matter

Summary: The O(n³) Baseline

We've established a thorough understanding of standard matrix multiplication—the computational workhorse that Strassen's algorithm aims to improve.

Key Takeaways

•The standard algorithm is Θ(n³) — Exactly n³ multiplications and n³-n² additions for n×n matrices
•This complexity arises from the definition — n² entries, each requiring n products to compute
•Cache behavior matters enormously — Loop order and blocking can provide 10-50× speedups without changing complexity
•n³ seems necessary but isn't — The block structure of matrices enables sharing that breaks the apparent barrier
•Applications demand better — Deep learning, scientific computing, and graphics all suffer from cubic scaling
•The true lower bound is unknown — Somewhere between n² and n^2.373

What's next:

The next page provides an overview of Strassen's algorithm, setting up the "divide" phase of this divide-and-conquer approach.

Page Complete

1 / 4