Loading learning content...
We've seen that Strassen's algorithm reduces 8 multiplications to 7 for 2×2 block matrix multiplication, and that this leads to O(n^2.807) complexity. But how does this reduction actually work? Why can't we reduce further to 6 or 5 multiplications? And how exactly do the savings at each level compound to produce the asymptotic improvement?
This page dissects the mechanics of multiplication reduction in divide-and-conquer matrix algorithms. We'll trace through the algebraic structure, understand why Strassen's formulas are arranged the way they are, and analyze the precise relationship between recursion depth and operation count.
By the end of this page, you will deeply understand how the 7-multiplication scheme works mechanically, why each M product contributes exactly what it needs to, how additions prepare operands for multiplications, and why the recursive structure amplifies a single saved multiplication into asymptotic improvement. You'll also learn about the theoretical limits of this approach.
To understand why we can reduce from 8 to 7 multiplications, we must first understand what each output block needs.
Standard formulas (what we need to compute):
C₁₁ = A₁₁B₁₁ + A₁₂B₂₁ (requires products of: A₁₁B₁₁, A₁₂B₂₁)
C₁₂ = A₁₁B₁₂ + A₁₂B₂₂ (requires products of: A₁₁B₁₂, A₁₂B₂₂)
C₂₁ = A₂₁B₁₁ + A₂₂B₂₁ (requires products of: A₂₁B₁₁, A₂₂B₂₁)
C₂₂ = A₂₁B₁₂ + A₂₂B₂₂ (requires products of: A₂₁B₁₂, A₂₂B₂₂)
Unique products needed: A₁₁B₁₁, A₁₂B₂₁, A₁₁B₁₂, A₁₂B₂₂, A₂₁B₁₁, A₂₂B₂₁, A₂₁B₁₂, A₂₂B₂₂
That's 8 distinct products, and each appears in exactly one output block. There's no direct sharing—each Cᵢⱼ uses a completely different pair of products.
This suggests 8 multiplications are necessary... but this analysis is flawed!
The flaw is assuming we must compute exactly those 8 products. What if we computed DIFFERENT products—products of sums and differences—that can be recombined to yield all four Cᵢⱼ blocks? The products needn't match the definition; they just need to produce the same result.
The key insight:
Consider computing (A₁₁ + A₂₂)(B₁₁ + B₂₂). Expanding:
(A₁₁ + A₂₂)(B₁₁ + B₂₂) = A₁₁B₁₁ + A₁₁B₂₂ + A₂₂B₁₁ + A₂₂B₂₂
This single multiplication produces four terms that appear in the standard formulas:
By carefully choosing what other products to compute, we can:
This is the essence of Strassen's approach: compute products of sums, then add/subtract to isolate what we actually want.
Let's expand each of Strassen's 7 M products and catalog exactly which terms they provide and require.
M₁ = (A₁₁ + A₂₂)(B₁₁ + B₂₂)
= A₁₁B₁₁ + A₁₁B₂₂ + A₂₂B₁₁ + A₂₂B₂₂
↑ extra ↑ extra
Provides: A₁₁B₁₁ (for C₁₁), A₂₂B₂₂ (for C₂₂)
Extras: A₁₁B₂₂, A₂₂B₁₁ (must be canceled)
M₂ = (A₂₁ + A₂₂)B₁₁
= A₂₁B₁₁ + A₂₂B₁₁
Provides: A₂₁B₁₁ (for C₂₁)
Extra: A₂₂B₁₁ (helps cancel M₁'s extra)
M₃ = A₁₁(B₁₂ - B₂₂)
= A₁₁B₁₂ - A₁₁B₂₂
Provides: A₁₁B₁₂ (for C₁₂)
Extra: -A₁₁B₂₂ (helps cancel M₁'s extra)
M₄ = A₂₂(B₂₁ - B₁₁)
= A₂₂B₂₁ - A₂₂B₁₁
Provides: A₂₂B₂₁ (for C₂₁)
Extra: -A₂₂B₁₁ (helps cancel M₁'s extra via C₁₁ formula)
M₅ = (A₁₁ + A₁₂)B₂₂
= A₁₁B₂₂ + A₁₂B₂₂
Provides: A₁₂B₂₂ (for C₁₂)
Extra: A₁₁B₂₂ (helps cancel M₁'s extra)
M₆ = (A₂₁ - A₁₁)(B₁₁ + B₁₂)
= A₂₁B₁₁ + A₂₁B₁₂ - A₁₁B₁₁ - A₁₁B₁₂
Provides: A₂₁B₁₂ (for C₂₂)
Extras: Various terms for cancellation
M₇ = (A₁₂ - A₂₂)(B₂₁ + B₂₂)
= A₁₂B₂₁ + A₁₂B₂₂ - A₂₂B₂₁ - A₂₂B₂₂
Provides: A₁₂B₂₁ (for C₁₁)
Extras: Various terms for cancellation
Each M product provides some needed terms (like A₁₁B₁₁) and some extra terms that must be canceled. Strassen's genius was finding a set of 7 products where ALL the extras can be perfectly canceled through the combination formulas, leaving exactly what we need for C₁₁, C₁₂, C₂₁, and C₂₂.
Let's trace through how C₁₁ = M₁ + M₄ - M₅ + M₇ yields exactly A₁₁B₁₁ + A₁₂B₂₁.
Expand each M:
M₁ = +A₁₁B₁₁ +A₁₁B₂₂ +A₂₂B₁₁ +A₂₂B₂₂
M₄ = +A₂₂B₂₁ -A₂₂B₁₁
-M₅ = -A₁₁B₂₂ -A₁₂B₂₂
M₇ = +A₁₂B₂₁ +A₁₂B₂₂ -A₂₂B₂₁ -A₂₂B₂₂
Collect coefficients for each product:
| Term | From M₁ | From M₄ | From -M₅ | From M₇ | Total | Expected |
|---|---|---|---|---|---|---|
| A₁₁B₁₁ | +1 | 0 | 0 | 0 | +1 | +1 ✓ |
| A₁₁B₂₂ | +1 | 0 | -1 | 0 | 0 | 0 ✓ |
| A₂₂B₁₁ | +1 | -1 | 0 | 0 | 0 | 0 ✓ |
| A₂₂B₂₂ | +1 | 0 | 0 | -1 | 0 | 0 ✓ |
| A₂₂B₂₁ | 0 | +1 | 0 | -1 | 0 | 0 ✓ |
| A₁₂B₂₂ | 0 | 0 | -1 | +1 | 0 | 0 ✓ |
| A₁₂B₂₁ | 0 | 0 | 0 | +1 | +1 | +1 ✓ |
Result: Every unwanted term has a total coefficient of 0, and exactly the terms A₁₁B₁₁ and A₁₂B₂₁ remain with coefficient +1.
This is no accident—Strassen designed the M products and the combination formula together so that the cancellations work out perfectly. The same analysis (with different signs) works for C₁₂, C₂₁, and C₂₂.
We have 8 unique products in the standard formulas, and 7 M products each producing 2-4 terms. To recover all 8 unique products with correct coefficients while canceling all extras requires solving a system with many constraints. That a solution exists with 7 (not 8) products is the remarkable fact.
A natural question: if Strassen reduced 8 to 7, can someone reduce 7 to 6? The answer, for 2×2 matrix multiplication using this specific divide-and-conquer structure, is no.
Theoretical lower bound:
In 2003, Landsberg proved that classical tensor methods require at least 7 multiplications to compute the matrix product of 2×2 matrices over arbitrary rings. More specifically:
This doesn't mean 7 is optimal for all n—larger block sizes might enable different savings.
What about different block sizes?
Strassen works on 2×2 blocks. What about 3×3?
Using 23 multiplications for 3×3 gives:
T(n) = 23T(n/3) + Θ(n²)
→ O(n^log₃(23)) = O(n^2.854)
This is worse than Strassen's O(n^2.807)! Larger block sizes don't necessarily help.
| Block Size | Multiplications | Complexity | Exponent |
|---|---|---|---|
| 2×2 (Standard) | 8 | O(n^log₂8) | 3.000 |
| 2×2 (Strassen) | 7 | O(n^log₂7) | 2.807 |
| 3×3 (Standard) | 27 | O(n^log₃27) | 3.000 |
| 3×3 (Laderman) | 23 | O(n^log₃23) | 2.854 |
| 70×70 (Coppersmith-Winograd) | ~143,000 | O(n^2.376) | 2.376 |
Algorithms like Coppersmith-Winograd achieve better exponents by using very large block sizes. However, the constant factors are so enormous that they'd only be faster for matrices larger than can fit in any computer. Strassen remains the only subcubic algorithm used in practice.
The magic of Strassen's algorithm is that saving 1 multiplication at each level compounds exponentially through the recursion.
Multiplication count at each level:
For an n×n matrix (n = 2^k):
Total scalar multiplications:
With n = 2^k → k = log₂(n):
Total = 7^k = 7^(log₂n) = n^(log₂7) = n^2.807
Comparison with standard algorithm:
Standard uses 8 multiplications per level:
The difference:
n³ / n^2.807 = n^0.193
For n = 1024: 1024^0.193 ≈ 6.8 times fewer operations
For n = 4096: 4096^0.193 ≈ 11.5 times fewer operations
For n = 16384: 16384^0.193 ≈ 19.6 times fewer operations
The ratio keeps growing! This is the power of improving the exponent.
Saving 1 multiplication out of 8 (a 12.5% reduction at each level) produces exponentially growing savings: 6.8× at n=1024, 11.5× at n=4096, 19.6× at n=16384. Each doubling of n increases the advantage by about 1.7×.
12345678910111213141516171819202122232425262728
import math def analyze_savings(): """ Compare multiplication counts for standard vs Strassen algorithm. """ print(f"{'n':>8} | {'Standard (n³)':>15} | {'Strassen (n^2.807)':>18} | {'Ratio':>8}") print("-" * 60) for k in range(4, 15): # n = 16 to 16384 n = 2 ** k # Standard: n³ scalar multiplications standard = n ** 3 # Strassen: 7^k scalar multiplications (k = log₂n) strassen = 7 ** k ratio = standard / strassen print(f"{n:>8} | {standard:>15,} | {strassen:>18,} | {ratio:>8.2f}x") analyze_savings() # Output demonstrates the growing advantage:# n = 16: 4× faster# n = 1024: 6.8× faster# n = 16384: 19.6× fasterStrassen requires 18 matrix additions/subtractions per level (vs 4 for standard). Why doesn't this overhead eliminate the savings?
Addition cost per level:
Total addition cost across all levels:
Total_additions = Σ(ℓ=0 to log₂n-1) [18 × (n/2^ℓ)² × 7^ℓ]
= 18n² × Σ(ℓ=0 to log₂n-1) [(7/4)^ℓ / 4^ℓ × 4^ℓ]
= 18n² × Σ(ℓ=0 to log₂n-1) [(7/4)^ℓ]
The sum is a geometric series with ratio 7/4 > 1, so the last term dominates:
Total_additions ≈ 18n² × (7/4)^(log₂n - 1) = O(n² × n^(log₂(7/4))) = O(n^2.807)
This is the same asymptotic order as the multiplications!
However, additions have much smaller constant factors than multiplications in the total work:
Total work:
T(n) = (multiplications) + (additions)
= O(n^2.807) + O(n^2.807)
= O(n^2.807)
The addition overhead doesn't change the complexity class—it just increases the constant factor.
While asymptotically irrelevant, Strassen's larger constant factor (due to 18 additions vs 4) means it only beats standard multiplication for matrices beyond a certain size, typically n ≈ 500-1000 depending on hardware and implementation.
Visualizing Strassen's algorithm as a recursion tree clarifies how the savings accumulate.
Standard algorithm tree:
[n×n]
│
┌───┬───┬───┬─┴─┬───┬───┬───┐
[n/2] ×8 products
│
┌───┬───┼───┬───┬───┬───┬───┐
[n/4] ×64 products
...
[1×1] ×n³ products
Each level has 8× more nodes than the previous. Total leaves: 8^(log₂n) = n³
Strassen's algorithm tree:
[n×n]
│
┌───┬───┬───┬─┴─┬───┬───┐
[n/2] ×7 products
│
┌───┬───┼───┬───┬───┬───┐
[n/4] ×49 products
...
[1×1] ×n^2.807 products
Each level has 7× more nodes than the previous. Total leaves: 7^(log₂n) = n^2.807
Work distribution in the tree:
For standard multiplication:
For Strassen:
The key difference:
Standard: 2^(log₂n) = n levels each doubling work → total n³ Strassen: (7/4)^(log₂n) = n^(log₂(7/4)) ≈ n^0.807 factor, so total n^2.807
The recursion tree view makes the savings intuitive: fewer branches at each level means fewer total leaves, and leaves represent the elementary operations. Strassen's tree has n^2.807 leaves vs n³ for standard—a ratio that grows without bound.
The divide-and-conquer structure of Strassen's algorithm has significant implications for memory usage and cache performance.
Memory overhead:
Temporary matrices: Each level needs intermediate matrices for sums (e.g., A₁₁ + A₂₂) and products (M₁ through M₇)
Recursion stack: O(log n) levels of recursion, each with O(1) pointers
Total extra space: The intermediate matrices require O(n²) additional memory
Standard algorithm memory: O(n²) for output (no additional needed) Strassen memory: O(n²) for output + O(n²) for intermediates = O(n²)
Asymptotically the same, but Strassen needs roughly 2-3× more memory in practice due to intermediate allocations.
Cache behavior:
Strassen's recursive structure has interesting cache properties:
Advantages:
Disadvantages:
Cache-oblivious property:
Strassen is technically "cache-oblivious"—it adapts to any cache size without explicit tuning. When subproblems become small enough to fit in cache, they execute efficiently. This contrasts with explicit tiled algorithms that must be tuned for specific cache sizes.
In practice, naive Strassen implementations often lose to well-tuned iterative algorithms due to memory overhead and cache effects. High-performance implementations use Strassen at the top levels (for asymptotic benefit) but switch to optimized BLAS kernels at the base case (for cache efficiency).
A critical implementation decision is when to stop recursing and switch to standard multiplication. The choice significantly affects performance.
The trade-off:
Factors affecting crossover:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
import numpy as npimport time def find_crossover(): """ Empirically find the crossover point between Strassen and standard. """ def standard_multiply(A, B): return A @ B # NumPy's BLAS-optimized multiplication def strassen(A, B, threshold): n = A.shape[0] if n <= threshold: return A @ B mid = n // 2 A11, A12 = A[:mid, :mid], A[:mid, mid:] A21, A22 = A[mid:, :mid], A[mid:, mid:] B11, B12 = B[:mid, :mid], B[:mid, mid:] B21, B22 = B[mid:, :mid], B[mid:, mid:] M1 = strassen(A11 + A22, B11 + B22, threshold) M2 = strassen(A21 + A22, B11, threshold) M3 = strassen(A11, B12 - B22, threshold) M4 = strassen(A22, B21 - B11, threshold) M5 = strassen(A11 + A12, B22, threshold) M6 = strassen(A21 - A11, B11 + B12, threshold) M7 = strassen(A12 - A22, B21 + B22, threshold) C = np.empty((n, n)) C[:mid, :mid] = M1 + M4 - M5 + M7 C[:mid, mid:] = M3 + M5 C[mid:, :mid] = M2 + M4 C[mid:, mid:] = M1 - M2 + M3 + M6 return C thresholds = [16, 32, 64, 128, 256, 512] n = 2048 A = np.random.rand(n, n) B = np.random.rand(n, n) print(f"Testing n={n}") print(f"{'Threshold':>12} | {'Time (seconds)':>15}") print("-" * 32) for t in thresholds: start = time.time() C = strassen(A, B, t) elapsed = time.time() - start print(f"{t:>12} | {elapsed:>15.4f}") # Typical output might show optimal threshold around 64-128Typical crossover points:
| Environment | Typical Crossover |
|---|---|
| Naive Python | n ≈ 512-1024 |
| NumPy/BLAS | n ≈ 256-512 |
| Optimized C | n ≈ 64-256 |
| GPU (theoretical) | n ≈ 1024+ |
Note: When competing against highly optimized BLAS implementations, Strassen often needs larger n to show benefit because BLAS exploits cache so effectively.
Production implementations (like Intel MKL, OpenBLAS) use a hybrid approach: Strassen for the first few recursive levels (to get asymptotic benefit), then switch to highly optimized BLAS kernels (to get cache/SIMD benefit). The threshold is often dynamically chosen based on detected hardware.
We've dissected exactly how Strassen's divide-and-conquer approach achieves subcubic matrix multiplication through reducing multiplications.
What's next:
Having fully understood how Strassen achieves O(n^2.807), we'll examine practical considerations: when to use Strassen vs standard algorithms, implementation challenges, numerical stability, and the state of matrix multiplication in modern computing systems.
You now deeply understand the mechanics of how Strassen reduces multiplications through divide-and-conquer: the algebraic structure, the cancellation patterns, the recursive amplification, and why 7 is the optimal number for 2×2 blocks. This mechanical understanding is essential for appreciating both the elegance and limitations of the algorithm.