Data Structures & AlgorithmsDivide and Conquer

Matrix Multiplication & Strassen's Algorithm

LevelAdvanced

Duration75 mins

TopicDivide and Conquer

3 / 4

Reducing Multiplications Through D&C

The Mechanics of Multiplication Reduction

We've seen that Strassen's algorithm reduces 8 multiplications to 7 for 2×2 block matrix multiplication, and that this leads to O(n^2.807) complexity. But how does this reduction actually work? Why can't we reduce further to 6 or 5 multiplications? And how exactly do the savings at each level compound to produce the asymptotic improvement?

This page dissects the mechanics of multiplication reduction in divide-and-conquer matrix algorithms. We'll trace through the algebraic structure, understand why Strassen's formulas are arranged the way they are, and analyze the precise relationship between recursion depth and operation count.

What You Will Learn

By the end of this page, you will deeply understand how the 7-multiplication scheme works mechanically, why each M product contributes exactly what it needs to, how additions prepare operands for multiplications, and why the recursive structure amplifies a single saved multiplication into asymptotic improvement. You'll also learn about the theoretical limits of this approach.

The Dependency Structure of Matrix Products

To understand why we can reduce from 8 to 7 multiplications, we must first understand what each output block needs.

Standard formulas (what we need to compute):

C₁₁ = A₁₁B₁₁ + A₁₂B₂₁    (requires products of: A₁₁B₁₁, A₁₂B₂₁)
C₁₂ = A₁₁B₁₂ + A₁₂B₂₂    (requires products of: A₁₁B₁₂, A₁₂B₂₂)
C₂₁ = A₂₁B₁₁ + A₂₂B₂₁    (requires products of: A₂₁B₁₁, A₂₂B₂₁)
C₂₂ = A₂₁B₁₂ + A₂₂B₂₂    (requires products of: A₂₁B₁₂, A₂₂B₂₂)

Unique products needed: A₁₁B₁₁, A₁₂B₂₁, A₁₁B₁₂, A₁₂B₂₂, A₂₁B₁₁, A₂₂B₂₁, A₂₁B₁₂, A₂₂B₂₂

That's 8 distinct products, and each appears in exactly one output block. There's no direct sharing—each Cᵢⱼ uses a completely different pair of products.

This suggests 8 multiplications are necessary... but this analysis is flawed!

The Hidden Opportunity

The flaw is assuming we must compute exactly those 8 products. What if we computed DIFFERENT products—products of sums and differences—that can be recombined to yield all four Cᵢⱼ blocks? The products needn't match the definition; they just need to produce the same result.

The key insight:

Consider computing (A₁₁ + A₂₂)(B₁₁ + B₂₂). Expanding:

(A₁₁ + A₂₂)(B₁₁ + B₂₂) = A₁₁B₁₁ + A₁₁B₂₂ + A₂₂B₁₁ + A₂₂B₂₂

This single multiplication produces four terms that appear in the standard formulas:

A₁₁B₁₁ appears in C₁₁
A₂₂B₂₂ appears in C₂₂
A₁₁B₂₂ and A₂₂B₁₁ are "extras" that need to be canceled

By carefully choosing what other products to compute, we can:

Generate the terms we need
Cancel out the extras through addition/subtraction

This is the essence of Strassen's approach: compute products of sums, then add/subtract to isolate what we actually want.

Dissecting Each M Product: What Every Multiplication Provides

Let's expand each of Strassen's 7 M products and catalog exactly which terms they provide and require.

M₁ = (A₁₁ + A₂₂)(B₁₁ + B₂₂)

= A₁₁B₁₁ + A₁₁B₂₂ + A₂₂B₁₁ + A₂₂B₂₂
          ↑ extra   ↑ extra
Provides: A₁₁B₁₁ (for C₁₁), A₂₂B₂₂ (for C₂₂)
Extras: A₁₁B₂₂, A₂₂B₁₁ (must be canceled)

M₂ = (A₂₁ + A₂₂)B₁₁

= A₂₁B₁₁ + A₂₂B₁₁
Provides: A₂₁B₁₁ (for C₂₁)
Extra: A₂₂B₁₁ (helps cancel M₁'s extra)

M₃ = A₁₁(B₁₂ - B₂₂)

= A₁₁B₁₂ - A₁₁B₂₂
Provides: A₁₁B₁₂ (for C₁₂)
Extra: -A₁₁B₂₂ (helps cancel M₁'s extra)

M₄ = A₂₂(B₂₁ - B₁₁)

= A₂₂B₂₁ - A₂₂B₁₁
Provides: A₂₂B₂₁ (for C₂₁)
Extra: -A₂₂B₁₁ (helps cancel M₁'s extra via C₁₁ formula)

M₅ = (A₁₁ + A₁₂)B₂₂

= A₁₁B₂₂ + A₁₂B₂₂
Provides: A₁₂B₂₂ (for C₁₂)
Extra: A₁₁B₂₂ (helps cancel M₁'s extra)

M₆ = (A₂₁ - A₁₁)(B₁₁ + B₁₂)

= A₂₁B₁₁ + A₂₁B₁₂ - A₁₁B₁₁ - A₁₁B₁₂
Provides: A₂₁B₁₂ (for C₂₂)
Extras: Various terms for cancellation

M₇ = (A₁₂ - A₂₂)(B₂₁ + B₂₂)

= A₁₂B₂₁ + A₁₂B₂₂ - A₂₂B₂₁ - A₂₂B₂₂
Provides: A₁₂B₂₁ (for C₁₁)
Extras: Various terms for cancellation

The Balancing Act

Each M product provides some needed terms (like A₁₁B₁₁) and some extra terms that must be canceled. Strassen's genius was finding a set of 7 products where ALL the extras can be perfectly canceled through the combination formulas, leaving exactly what we need for C₁₁, C₁₂, C₂₁, and C₂₂.

Tracing the Cancellation: How C₁₁ Emerges

Let's trace through how C₁₁ = M₁ + M₄ - M₅ + M₇ yields exactly A₁₁B₁₁ + A₁₂B₂₁.

Expand each M:

M₁ = +A₁₁B₁₁ +A₁₁B₂₂ +A₂₂B₁₁ +A₂₂B₂₂
M₄ = +A₂₂B₂₁ -A₂₂B₁₁
-M₅ = -A₁₁B₂₂ -A₁₂B₂₂
M₇ = +A₁₂B₂₁ +A₁₂B₂₂ -A₂₂B₂₁ -A₂₂B₂₂

Collect coefficients for each product:

Term Cancellation Analysis for C₁₁
Term	From M₁	From M₄	From -M₅	From M₇	Total	Expected
A₁₁B₁₁	+1	0	0	0	+1	+1 ✓
A₁₁B₂₂	+1	0	-1	0	0	0 ✓
A₂₂B₁₁	+1	-1	0	0	0	0 ✓
A₂₂B₂₂	+1	0	0	-1	0	0 ✓
A₂₂B₂₁	0	+1	0	-1	0	0 ✓
A₁₂B₂₂	0	0	-1	+1	0	0 ✓
A₁₂B₂₁	0	0	0	+1	+1	+1 ✓

Result: Every unwanted term has a total coefficient of 0, and exactly the terms A₁₁B₁₁ and A₁₂B₂₁ remain with coefficient +1.

This is no accident—Strassen designed the M products and the combination formula together so that the cancellations work out perfectly. The same analysis (with different signs) works for C₁₂, C₂₁, and C₂₂.

The System is Overconstrained

We have 8 unique products in the standard formulas, and 7 M products each producing 2-4 terms. To recover all 8 unique products with correct coefficients while canceling all extras requires solving a system with many constraints. That a solution exists with 7 (not 8) products is the remarkable fact.

Why 7 Multiplications? Can We Do Better?

A natural question: if Strassen reduced 8 to 7, can someone reduce 7 to 6? The answer, for 2×2 matrix multiplication using this specific divide-and-conquer structure, is no.

Theoretical lower bound:

In 2003, Landsberg proved that classical tensor methods require at least 7 multiplications to compute the matrix product of 2×2 matrices over arbitrary rings. More specifically:

The rank of the 2×2 matrix multiplication tensor is exactly 7
This means no set of 6 (or fewer) bilinear products of inputs can compute all 4 outputs
Strassen's algorithm is optimal for 2×2 blocks

This doesn't mean 7 is optimal for all n—larger block sizes might enable different savings.

What about different block sizes?

Strassen works on 2×2 blocks. What about 3×3?

Standard 3×3 block multiplication: 27 multiplications
Best known for 3×3: 23 multiplications (Laderman, 1976)
Theoretical minimum: Unknown, but > 19

Using 23 multiplications for 3×3 gives:

T(n) = 23T(n/3) + Θ(n²)
→ O(n^log₃(23)) = O(n^2.854)

This is worse than Strassen's O(n^2.807)! Larger block sizes don't necessarily help.

Block Size vs Asymptotic Complexity
Block Size	Multiplications	Complexity	Exponent
2×2 (Standard)	8	O(n^log₂8)	3.000
2×2 (Strassen)	7	O(n^log₂7)	2.807
3×3 (Standard)	27	O(n^log₃27)	3.000
3×3 (Laderman)	23	O(n^log₃23)	2.854
70×70 (Coppersmith-Winograd)	~143,000	O(n^2.376)	2.376

The Galactic Algorithms

Algorithms like Coppersmith-Winograd achieve better exponents by using very large block sizes. However, the constant factors are so enormous that they'd only be faster for matrices larger than can fit in any computer. Strassen remains the only subcubic algorithm used in practice.

Recursive Amplification: How Savings Compound

The magic of Strassen's algorithm is that saving 1 multiplication at each level compounds exponentially through the recursion.

Multiplication count at each level:

For an n×n matrix (n = 2^k):

Level 0: 1 multiplication of n×n matrices = 7 multiplications of (n/2)×(n/2) matrices
Level 1: 7 multiplications of (n/2)×(n/2) = 7² = 49 multiplications of (n/4)×(n/4)
Level 2: 49 multiplications of (n/4)×(n/4) = 7³ = 343 multiplications of (n/8)×(n/8)
...
Level k: 7^k multiplications of 1×1 matrices (scalars)

Total scalar multiplications:

With n = 2^k → k = log₂(n):

Total = 7^k = 7^(log₂n) = n^(log₂7) = n^2.807

Comparison with standard algorithm:

Standard uses 8 multiplications per level:

Level k: 8^k = 8^(log₂n) = n^(log₂8) = n³ scalar multiplications

The difference:

n³ / n^2.807 = n^0.193

For n = 1024: 1024^0.193 ≈ 6.8 times fewer operations
For n = 4096: 4096^0.193 ≈ 11.5 times fewer operations  
For n = 16384: 16384^0.193 ≈ 19.6 times fewer operations

The ratio keeps growing! This is the power of improving the exponent.

Exponential Savings from Linear Improvement

Saving 1 multiplication out of 8 (a 12.5% reduction at each level) produces exponentially growing savings: 6.8× at n=1024, 11.5× at n=4096, 19.6× at n=16384. Each doubling of n increases the advantage by about 1.7×.

savings_analysis
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import math
 
def analyze_savings():
    """
    Compare multiplication counts for standard vs Strassen algorithm.
    """
    print(f"{'n':>8} | {'Standard (n³)':>15} | {'Strassen (n^2.807)':>18} | {'Ratio':>8}")
    print("-" * 60)
    
    for k in range(4, 15):  # n = 16 to 16384
        n = 2 ** k
        
        # Standard: n³ scalar multiplications
        standard = n ** 3
        
        # Strassen: 7^k scalar multiplications (k = log₂n)
        strassen = 7 ** k
        
        ratio = standard / strassen
        
        print(f"{n:>8} | {standard:>15,} | {strassen:>18,} | {ratio:>8.2f}x")
 
analyze_savings()
 
# Output demonstrates the growing advantage:
# n = 16:      4× faster
# n = 1024:    6.8× faster
# n = 16384:   19.6× faster

Addition Cost: Why It Doesn't Ruin the Savings

Strassen requires 18 matrix additions/subtractions per level (vs 4 for standard). Why doesn't this overhead eliminate the savings?

Addition cost per level:

Each addition operates on (n/2^ℓ) × (n/2^ℓ) matrices at level ℓ
Cost per addition: (n/2^ℓ)²
18 additions per level: 18 × (n/2^ℓ)²

Total addition cost across all levels:

Total_additions = Σ(ℓ=0 to log₂n-1) [18 × (n/2^ℓ)² × 7^ℓ]
                = 18n² × Σ(ℓ=0 to log₂n-1) [(7/4)^ℓ / 4^ℓ × 4^ℓ]
                = 18n² × Σ(ℓ=0 to log₂n-1) [(7/4)^ℓ]

The sum is a geometric series with ratio 7/4 > 1, so the last term dominates:

Total_additions ≈ 18n² × (7/4)^(log₂n - 1) = O(n² × n^(log₂(7/4))) = O(n^2.807)

This is the same asymptotic order as the multiplications!

However, additions have much smaller constant factors than multiplications in the total work:

Multiplications dominate at the deepest recursion level
Additions are simpler operations (just element-wise)
The 18/7 ratio of additions to multiplications per level is absorbed into constants

Total work:

T(n) = (multiplications) + (additions)
     = O(n^2.807) + O(n^2.807)
     = O(n^2.807)

The addition overhead doesn't change the complexity class—it just increases the constant factor.

Constant Factors Matter in Practice

While asymptotically irrelevant, Strassen's larger constant factor (due to 18 additions vs 4) means it only beats standard multiplication for matrices beyond a certain size, typically n ≈ 500-1000 depending on hardware and implementation.

The Recursion Tree Perspective

Visualizing Strassen's algorithm as a recursion tree clarifies how the savings accumulate.

Standard algorithm tree:

                        [n×n]
                          │
            ┌───┬───┬───┬─┴─┬───┬───┬───┐
          [n/2] ×8 products
            │
    ┌───┬───┼───┬───┬───┬───┬───┐
  [n/4] ×64 products
...
[1×1] ×n³ products

Each level has 8× more nodes than the previous. Total leaves: 8^(log₂n) = n³

Strassen's algorithm tree:

                        [n×n]
                          │
            ┌───┬───┬───┬─┴─┬───┬───┐
          [n/2] ×7 products
            │
    ┌───┬───┼───┬───┬───┬───┐
  [n/4] ×49 products
...
[1×1] ×n^2.807 products

Each level has 7× more nodes than the previous. Total leaves: 7^(log₂n) = n^2.807

Work distribution in the tree:

For standard multiplication:

Level ℓ has 8^ℓ nodes, each doing O(n²/4^ℓ) addition work
Work at level ℓ: 8^ℓ × n²/4^ℓ = n² × 2^ℓ
Work grows geometrically—bottom levels dominate

For Strassen:

Level ℓ has 7^ℓ nodes, each doing O(n²/4^ℓ) addition work
Work at level ℓ: 7^ℓ × n²/4^ℓ = n² × (7/4)^ℓ
Work still grows geometrically, but slower (7/4 vs 2)

The key difference:

Standard: 2^(log₂n) = n levels each doubling work → total n³ Strassen: (7/4)^(log₂n) = n^(log₂(7/4)) ≈ n^0.807 factor, so total n^2.807

Fewer Branches = Less Total Work

The recursion tree view makes the savings intuitive: fewer branches at each level means fewer total leaves, and leaves represent the elementary operations. Strassen's tree has n^2.807 leaves vs n³ for standard—a ratio that grows without bound.

Memory and Cache Considerations in D&C

The divide-and-conquer structure of Strassen's algorithm has significant implications for memory usage and cache performance.

Memory overhead:

Temporary matrices: Each level needs intermediate matrices for sums (e.g., A₁₁ + A₂₂) and products (M₁ through M₇)
Recursion stack: O(log n) levels of recursion, each with O(1) pointers
Total extra space: The intermediate matrices require O(n²) additional memory

Standard algorithm memory: O(n²) for output (no additional needed) Strassen memory: O(n²) for output + O(n²) for intermediates = O(n²)

Asymptotically the same, but Strassen needs roughly 2-3× more memory in practice due to intermediate allocations.

Cache behavior:

Strassen's recursive structure has interesting cache properties:

Advantages:

Recursion naturally works on smaller submatrices that may fit in cache
Once a subproblem fits in L2/L3 cache, all accesses are fast
This is a form of automatic tiling/blocking

Disadvantages:

The many intermediate matrices may cause cache pollution
Non-contiguous memory access patterns when partitioning
Recursion overhead disrupts prefetching

Cache-oblivious property:

Strassen is technically "cache-oblivious"—it adapts to any cache size without explicit tuning. When subproblems become small enough to fit in cache, they execute efficiently. This contrasts with explicit tiled algorithms that must be tuned for specific cache sizes.

Implementation Matters More Than Theory

In practice, naive Strassen implementations often lose to well-tuned iterative algorithms due to memory overhead and cache effects. High-performance implementations use Strassen at the top levels (for asymptotic benefit) but switch to optimized BLAS kernels at the base case (for cache efficiency).

The Base Case Trade-off: When to Stop Recursing

A critical implementation decision is when to stop recursing and switch to standard multiplication. The choice significantly affects performance.

The trade-off:

Small n: Strassen's constant factor overhead (18 additions vs 4) dominates
Large n: Strassen's better exponent dominates
Crossover point: The n at which Strassen becomes faster

Factors affecting crossover:

Hardware: CPU architecture, cache sizes, memory bandwidth
Implementation quality: SIMD usage, memory layout, compiler optimization
Numerical precision: Single vs double precision
Base case optimization: How well the standard algorithm is tuned

crossover_analysis
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import numpy as np
import time
 
def find_crossover():
    """
    Empirically find the crossover point between Strassen and standard.
    """
    
    def standard_multiply(A, B):
        return A @ B  # NumPy's BLAS-optimized multiplication
    
    def strassen(A, B, threshold):
        n = A.shape[0]
        if n <= threshold:
            return A @ B
        
        mid = n // 2
        A11, A12 = A[:mid, :mid], A[:mid, mid:]
        A21, A22 = A[mid:, :mid], A[mid:, mid:]
        B11, B12 = B[:mid, :mid], B[:mid, mid:]
        B21, B22 = B[mid:, :mid], B[mid:, mid:]
        
        M1 = strassen(A11 + A22, B11 + B22, threshold)
        M2 = strassen(A21 + A22, B11, threshold)
        M3 = strassen(A11, B12 - B22, threshold)
        M4 = strassen(A22, B21 - B11, threshold)
        M5 = strassen(A11 + A12, B22, threshold)
        M6 = strassen(A21 - A11, B11 + B12, threshold)
        M7 = strassen(A12 - A22, B21 + B22, threshold)
        
        C = np.empty((n, n))
        C[:mid, :mid] = M1 + M4 - M5 + M7
        C[:mid, mid:] = M3 + M5
        C[mid:, :mid] = M2 + M4
        C[mid:, mid:] = M1 - M2 + M3 + M6
        return C
    
    thresholds = [16, 32, 64, 128, 256, 512]
    n = 2048
    
    A = np.random.rand(n, n)
    B = np.random.rand(n, n)
    
    print(f"Testing n={n}")
    print(f"{'Threshold':>12} | {'Time (seconds)':>15}")
    print("-" * 32)
    
    for t in thresholds:
        start = time.time()
        C = strassen(A, B, t)
        elapsed = time.time() - start
        print(f"{t:>12} | {elapsed:>15.4f}")
 
# Typical output might show optimal threshold around 64-128

Typical crossover points:

Environment	Typical Crossover
Naive Python	n ≈ 512-1024
NumPy/BLAS	n ≈ 256-512
Optimized C	n ≈ 64-256
GPU (theoretical)	n ≈ 1024+

Note: When competing against highly optimized BLAS implementations, Strassen often needs larger n to show benefit because BLAS exploits cache so effectively.

Hybrid Approach is Best

Production implementations (like Intel MKL, OpenBLAS) use a hybrid approach: Strassen for the first few recursive levels (to get asymptotic benefit), then switch to highly optimized BLAS kernels (to get cache/SIMD benefit). The threshold is often dynamically chosen based on detected hardware.

Summary: The Mechanics of Multiplication Reduction

We've dissected exactly how Strassen's divide-and-conquer approach achieves subcubic matrix multiplication through reducing multiplications.

Key Takeaways

•Products of sums enable sharing — By computing (A₁₁+A₂₂)(B₁₁+B₂₂) instead of A₁₁B₁₁, we get terms usable in multiple outputs
•Careful algebraic design cancels extras — Each M product provides needed terms plus extras; the combination formulas cancel exactly what we don't want
•7 is optimal for 2×2 blocks — The tensor rank of 2×2 matrix multiplication is exactly 7; no algorithm can do better with this block structure
•Savings compound exponentially — One fewer multiplication per level → 7^k vs 8^k leaves in the recursion tree → n^2.807 vs n³
•Additions don't kill the savings — 18 vs 4 additions increases constant factor but doesn't change asymptotic complexity
•Base case choice is critical — In practice, switch to optimized BLAS around n=64-256 for best performance

What's next:

Having fully understood how Strassen achieves O(n^2.807), we'll examine practical considerations: when to use Strassen vs standard algorithms, implementation challenges, numerical stability, and the state of matrix multiplication in modern computing systems.

Page Complete

You now deeply understand the mechanics of how Strassen reduces multiplications through divide-and-conquer: the algebraic structure, the cancellation patterns, the recursive amplification, and why 7 is the optimal number for 2×2 blocks. This mechanical understanding is essential for appreciating both the elegance and limitations of the algorithm.

3 / 4

Loading learning content...

Data Structures & AlgorithmsDivide and Conquer

Matrix Multiplication & Strassen's Algorithm

LevelAdvanced

Duration75 mins

TopicDivide and Conquer

3 / 4

Reducing Multiplications Through D&C

The Mechanics of Multiplication Reduction

What You Will Learn

The Dependency Structure of Matrix Products

To understand why we can reduce from 8 to 7 multiplications, we must first understand what each output block needs.

Standard formulas (what we need to compute):

C₁₁ = A₁₁B₁₁ + A₁₂B₂₁    (requires products of: A₁₁B₁₁, A₁₂B₂₁)
C₁₂ = A₁₁B₁₂ + A₁₂B₂₂    (requires products of: A₁₁B₁₂, A₁₂B₂₂)
C₂₁ = A₂₁B₁₁ + A₂₂B₂₁    (requires products of: A₂₁B₁₁, A₂₂B₂₁)
C₂₂ = A₂₁B₁₂ + A₂₂B₂₂    (requires products of: A₂₁B₁₂, A₂₂B₂₂)

Unique products needed: A₁₁B₁₁, A₁₂B₂₁, A₁₁B₁₂, A₁₂B₂₂, A₂₁B₁₁, A₂₂B₂₁, A₂₁B₁₂, A₂₂B₂₂

That's 8 distinct products, and each appears in exactly one output block. There's no direct sharing—each Cᵢⱼ uses a completely different pair of products.

This suggests 8 multiplications are necessary... but this analysis is flawed!

The Hidden Opportunity

The key insight:

Consider computing (A₁₁ + A₂₂)(B₁₁ + B₂₂). Expanding:

(A₁₁ + A₂₂)(B₁₁ + B₂₂) = A₁₁B₁₁ + A₁₁B₂₂ + A₂₂B₁₁ + A₂₂B₂₂

This single multiplication produces four terms that appear in the standard formulas:

A₁₁B₁₁ appears in C₁₁
A₂₂B₂₂ appears in C₂₂
A₁₁B₂₂ and A₂₂B₁₁ are "extras" that need to be canceled

By carefully choosing what other products to compute, we can:

Generate the terms we need
Cancel out the extras through addition/subtraction

This is the essence of Strassen's approach: compute products of sums, then add/subtract to isolate what we actually want.

Dissecting Each M Product: What Every Multiplication Provides

Let's expand each of Strassen's 7 M products and catalog exactly which terms they provide and require.

M₁ = (A₁₁ + A₂₂)(B₁₁ + B₂₂)

= A₁₁B₁₁ + A₁₁B₂₂ + A₂₂B₁₁ + A₂₂B₂₂
          ↑ extra   ↑ extra
Provides: A₁₁B₁₁ (for C₁₁), A₂₂B₂₂ (for C₂₂)
Extras: A₁₁B₂₂, A₂₂B₁₁ (must be canceled)

M₂ = (A₂₁ + A₂₂)B₁₁

= A₂₁B₁₁ + A₂₂B₁₁
Provides: A₂₁B₁₁ (for C₂₁)
Extra: A₂₂B₁₁ (helps cancel M₁'s extra)

M₃ = A₁₁(B₁₂ - B₂₂)

= A₁₁B₁₂ - A₁₁B₂₂
Provides: A₁₁B₁₂ (for C₁₂)
Extra: -A₁₁B₂₂ (helps cancel M₁'s extra)

M₄ = A₂₂(B₂₁ - B₁₁)

= A₂₂B₂₁ - A₂₂B₁₁
Provides: A₂₂B₂₁ (for C₂₁)
Extra: -A₂₂B₁₁ (helps cancel M₁'s extra via C₁₁ formula)

M₅ = (A₁₁ + A₁₂)B₂₂

= A₁₁B₂₂ + A₁₂B₂₂
Provides: A₁₂B₂₂ (for C₁₂)
Extra: A₁₁B₂₂ (helps cancel M₁'s extra)

M₆ = (A₂₁ - A₁₁)(B₁₁ + B₁₂)

= A₂₁B₁₁ + A₂₁B₁₂ - A₁₁B₁₁ - A₁₁B₁₂
Provides: A₂₁B₁₂ (for C₂₂)
Extras: Various terms for cancellation

M₇ = (A₁₂ - A₂₂)(B₂₁ + B₂₂)

= A₁₂B₂₁ + A₁₂B₂₂ - A₂₂B₂₁ - A₂₂B₂₂
Provides: A₁₂B₂₁ (for C₁₁)
Extras: Various terms for cancellation

The Balancing Act

Tracing the Cancellation: How C₁₁ Emerges

Let's trace through how C₁₁ = M₁ + M₄ - M₅ + M₇ yields exactly A₁₁B₁₁ + A₁₂B₂₁.

Expand each M:

M₁ = +A₁₁B₁₁ +A₁₁B₂₂ +A₂₂B₁₁ +A₂₂B₂₂
M₄ = +A₂₂B₂₁ -A₂₂B₁₁
-M₅ = -A₁₁B₂₂ -A₁₂B₂₂
M₇ = +A₁₂B₂₁ +A₁₂B₂₂ -A₂₂B₂₁ -A₂₂B₂₂

Collect coefficients for each product:

Term Cancellation Analysis for C₁₁
Term	From M₁	From M₄	From -M₅	From M₇	Total	Expected
A₁₁B₁₁	+1	0	0	0	+1	+1 ✓
A₁₁B₂₂	+1	0	-1	0	0	0 ✓
A₂₂B₁₁	+1	-1	0	0	0	0 ✓
A₂₂B₂₂	+1	0	0	-1	0	0 ✓
A₂₂B₂₁	0	+1	0	-1	0	0 ✓
A₁₂B₂₂	0	0	-1	+1	0	0 ✓
A₁₂B₂₁	0	0	0	+1	+1	+1 ✓

Result: Every unwanted term has a total coefficient of 0, and exactly the terms A₁₁B₁₁ and A₁₂B₂₁ remain with coefficient +1.

The System is Overconstrained

Why 7 Multiplications? Can We Do Better?

A natural question: if Strassen reduced 8 to 7, can someone reduce 7 to 6? The answer, for 2×2 matrix multiplication using this specific divide-and-conquer structure, is no.

Theoretical lower bound:

In 2003, Landsberg proved that classical tensor methods require at least 7 multiplications to compute the matrix product of 2×2 matrices over arbitrary rings. More specifically:

The rank of the 2×2 matrix multiplication tensor is exactly 7
This means no set of 6 (or fewer) bilinear products of inputs can compute all 4 outputs
Strassen's algorithm is optimal for 2×2 blocks

This doesn't mean 7 is optimal for all n—larger block sizes might enable different savings.

What about different block sizes?

Strassen works on 2×2 blocks. What about 3×3?

Standard 3×3 block multiplication: 27 multiplications
Best known for 3×3: 23 multiplications (Laderman, 1976)
Theoretical minimum: Unknown, but > 19

Using 23 multiplications for 3×3 gives:

T(n) = 23T(n/3) + Θ(n²)
→ O(n^log₃(23)) = O(n^2.854)

This is worse than Strassen's O(n^2.807)! Larger block sizes don't necessarily help.

Block Size vs Asymptotic Complexity
Block Size	Multiplications	Complexity	Exponent
2×2 (Standard)	8	O(n^log₂8)	3.000
2×2 (Strassen)	7	O(n^log₂7)	2.807
3×3 (Standard)	27	O(n^log₃27)	3.000
3×3 (Laderman)	23	O(n^log₃23)	2.854
70×70 (Coppersmith-Winograd)	~143,000	O(n^2.376)	2.376

The Galactic Algorithms

Recursive Amplification: How Savings Compound

The magic of Strassen's algorithm is that saving 1 multiplication at each level compounds exponentially through the recursion.

Multiplication count at each level:

For an n×n matrix (n = 2^k):

Level 0: 1 multiplication of n×n matrices = 7 multiplications of (n/2)×(n/2) matrices
Level 1: 7 multiplications of (n/2)×(n/2) = 7² = 49 multiplications of (n/4)×(n/4)
Level 2: 49 multiplications of (n/4)×(n/4) = 7³ = 343 multiplications of (n/8)×(n/8)
...
Level k: 7^k multiplications of 1×1 matrices (scalars)

Total scalar multiplications:

With n = 2^k → k = log₂(n):

Total = 7^k = 7^(log₂n) = n^(log₂7) = n^2.807

Comparison with standard algorithm:

Standard uses 8 multiplications per level:

Level k: 8^k = 8^(log₂n) = n^(log₂8) = n³ scalar multiplications

The difference:

n³ / n^2.807 = n^0.193

For n = 1024: 1024^0.193 ≈ 6.8 times fewer operations
For n = 4096: 4096^0.193 ≈ 11.5 times fewer operations  
For n = 16384: 16384^0.193 ≈ 19.6 times fewer operations

The ratio keeps growing! This is the power of improving the exponent.

Exponential Savings from Linear Improvement

savings_analysis
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import math
 
def analyze_savings():
    """
    Compare multiplication counts for standard vs Strassen algorithm.
    """
    print(f"{'n':>8} | {'Standard (n³)':>15} | {'Strassen (n^2.807)':>18} | {'Ratio':>8}")
    print("-" * 60)
    
    for k in range(4, 15):  # n = 16 to 16384
        n = 2 ** k
        
        # Standard: n³ scalar multiplications
        standard = n ** 3
        
        # Strassen: 7^k scalar multiplications (k = log₂n)
        strassen = 7 ** k
        
        ratio = standard / strassen
        
        print(f"{n:>8} | {standard:>15,} | {strassen:>18,} | {ratio:>8.2f}x")
 
analyze_savings()
 
# Output demonstrates the growing advantage:
# n = 16:      4× faster
# n = 1024:    6.8× faster
# n = 16384:   19.6× faster

Addition Cost: Why It Doesn't Ruin the Savings

Strassen requires 18 matrix additions/subtractions per level (vs 4 for standard). Why doesn't this overhead eliminate the savings?

Addition cost per level:

Each addition operates on (n/2^ℓ) × (n/2^ℓ) matrices at level ℓ
Cost per addition: (n/2^ℓ)²
18 additions per level: 18 × (n/2^ℓ)²

Total addition cost across all levels:

Total_additions = Σ(ℓ=0 to log₂n-1) [18 × (n/2^ℓ)² × 7^ℓ]
                = 18n² × Σ(ℓ=0 to log₂n-1) [(7/4)^ℓ / 4^ℓ × 4^ℓ]
                = 18n² × Σ(ℓ=0 to log₂n-1) [(7/4)^ℓ]

The sum is a geometric series with ratio 7/4 > 1, so the last term dominates:

Total_additions ≈ 18n² × (7/4)^(log₂n - 1) = O(n² × n^(log₂(7/4))) = O(n^2.807)

This is the same asymptotic order as the multiplications!

However, additions have much smaller constant factors than multiplications in the total work:

Multiplications dominate at the deepest recursion level
Additions are simpler operations (just element-wise)
The 18/7 ratio of additions to multiplications per level is absorbed into constants

Total work:

T(n) = (multiplications) + (additions)
     = O(n^2.807) + O(n^2.807)
     = O(n^2.807)

The addition overhead doesn't change the complexity class—it just increases the constant factor.

Constant Factors Matter in Practice

The Recursion Tree Perspective

Visualizing Strassen's algorithm as a recursion tree clarifies how the savings accumulate.

Standard algorithm tree:

                        [n×n]
                          │
            ┌───┬───┬───┬─┴─┬───┬───┬───┐
          [n/2] ×8 products
            │
    ┌───┬───┼───┬───┬───┬───┬───┐
  [n/4] ×64 products
...
[1×1] ×n³ products

Each level has 8× more nodes than the previous. Total leaves: 8^(log₂n) = n³

Strassen's algorithm tree:

                        [n×n]
                          │
            ┌───┬───┬───┬─┴─┬───┬───┐
          [n/2] ×7 products
            │
    ┌───┬───┼───┬───┬───┬───┐
  [n/4] ×49 products
...
[1×1] ×n^2.807 products

Each level has 7× more nodes than the previous. Total leaves: 7^(log₂n) = n^2.807

Work distribution in the tree:

For standard multiplication:

Level ℓ has 8^ℓ nodes, each doing O(n²/4^ℓ) addition work
Work at level ℓ: 8^ℓ × n²/4^ℓ = n² × 2^ℓ
Work grows geometrically—bottom levels dominate

For Strassen:

Level ℓ has 7^ℓ nodes, each doing O(n²/4^ℓ) addition work
Work at level ℓ: 7^ℓ × n²/4^ℓ = n² × (7/4)^ℓ
Work still grows geometrically, but slower (7/4 vs 2)

The key difference:

Standard: 2^(log₂n) = n levels each doubling work → total n³ Strassen: (7/4)^(log₂n) = n^(log₂(7/4)) ≈ n^0.807 factor, so total n^2.807

Fewer Branches = Less Total Work

Memory and Cache Considerations in D&C

The divide-and-conquer structure of Strassen's algorithm has significant implications for memory usage and cache performance.

Memory overhead:

Temporary matrices: Each level needs intermediate matrices for sums (e.g., A₁₁ + A₂₂) and products (M₁ through M₇)
Recursion stack: O(log n) levels of recursion, each with O(1) pointers
Total extra space: The intermediate matrices require O(n²) additional memory

Standard algorithm memory: O(n²) for output (no additional needed) Strassen memory: O(n²) for output + O(n²) for intermediates = O(n²)

Asymptotically the same, but Strassen needs roughly 2-3× more memory in practice due to intermediate allocations.

Cache behavior:

Strassen's recursive structure has interesting cache properties:

Advantages:

Recursion naturally works on smaller submatrices that may fit in cache
Once a subproblem fits in L2/L3 cache, all accesses are fast
This is a form of automatic tiling/blocking

Disadvantages:

The many intermediate matrices may cause cache pollution
Non-contiguous memory access patterns when partitioning
Recursion overhead disrupts prefetching

Cache-oblivious property:

Implementation Matters More Than Theory

The Base Case Trade-off: When to Stop Recursing

A critical implementation decision is when to stop recursing and switch to standard multiplication. The choice significantly affects performance.

The trade-off:

Small n: Strassen's constant factor overhead (18 additions vs 4) dominates
Large n: Strassen's better exponent dominates
Crossover point: The n at which Strassen becomes faster

Factors affecting crossover:

Hardware: CPU architecture, cache sizes, memory bandwidth
Implementation quality: SIMD usage, memory layout, compiler optimization
Numerical precision: Single vs double precision
Base case optimization: How well the standard algorithm is tuned

crossover_analysis
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import numpy as np
import time
 
def find_crossover():
    """
    Empirically find the crossover point between Strassen and standard.
    """
    
    def standard_multiply(A, B):
        return A @ B  # NumPy's BLAS-optimized multiplication
    
    def strassen(A, B, threshold):
        n = A.shape[0]
        if n <= threshold:
            return A @ B
        
        mid = n // 2
        A11, A12 = A[:mid, :mid], A[:mid, mid:]
        A21, A22 = A[mid:, :mid], A[mid:, mid:]
        B11, B12 = B[:mid, :mid], B[:mid, mid:]
        B21, B22 = B[mid:, :mid], B[mid:, mid:]
        
        M1 = strassen(A11 + A22, B11 + B22, threshold)
        M2 = strassen(A21 + A22, B11, threshold)
        M3 = strassen(A11, B12 - B22, threshold)
        M4 = strassen(A22, B21 - B11, threshold)
        M5 = strassen(A11 + A12, B22, threshold)
        M6 = strassen(A21 - A11, B11 + B12, threshold)
        M7 = strassen(A12 - A22, B21 + B22, threshold)
        
        C = np.empty((n, n))
        C[:mid, :mid] = M1 + M4 - M5 + M7
        C[:mid, mid:] = M3 + M5
        C[mid:, :mid] = M2 + M4
        C[mid:, mid:] = M1 - M2 + M3 + M6
        return C
    
    thresholds = [16, 32, 64, 128, 256, 512]
    n = 2048
    
    A = np.random.rand(n, n)
    B = np.random.rand(n, n)
    
    print(f"Testing n={n}")
    print(f"{'Threshold':>12} | {'Time (seconds)':>15}")
    print("-" * 32)
    
    for t in thresholds:
        start = time.time()
        C = strassen(A, B, t)
        elapsed = time.time() - start
        print(f"{t:>12} | {elapsed:>15.4f}")
 
# Typical output might show optimal threshold around 64-128

Typical crossover points:

Environment	Typical Crossover
Naive Python	n ≈ 512-1024
NumPy/BLAS	n ≈ 256-512
Optimized C	n ≈ 64-256
GPU (theoretical)	n ≈ 1024+

Note: When competing against highly optimized BLAS implementations, Strassen often needs larger n to show benefit because BLAS exploits cache so effectively.

Hybrid Approach is Best

Summary: The Mechanics of Multiplication Reduction

We've dissected exactly how Strassen's divide-and-conquer approach achieves subcubic matrix multiplication through reducing multiplications.

Key Takeaways

•Products of sums enable sharing — By computing (A₁₁+A₂₂)(B₁₁+B₂₂) instead of A₁₁B₁₁, we get terms usable in multiple outputs
•Careful algebraic design cancels extras — Each M product provides needed terms plus extras; the combination formulas cancel exactly what we don't want
•7 is optimal for 2×2 blocks — The tensor rank of 2×2 matrix multiplication is exactly 7; no algorithm can do better with this block structure
•Savings compound exponentially — One fewer multiplication per level → 7^k vs 8^k leaves in the recursion tree → n^2.807 vs n³
•Additions don't kill the savings — 18 vs 4 additions increases constant factor but doesn't change asymptotic complexity
•Base case choice is critical — In practice, switch to optimized BLAS around n=64-256 for best performance

What's next:

Page Complete

3 / 4