Neural Networks & Deep LearningMulti-Layer Perceptrons

Multi-Layer Perceptrons

LevelIntermediate

Duration90 mins

TopicMulti-Layer Perceptrons

4 / 5

Matrix Formulation

The Power of Matrix Notation

Matrix formulation isn't merely a notational convenience—it's the key to understanding and implementing neural networks efficiently. Every modern deep learning framework represents neural network computation as matrix operations, and this representation enables:

Theoretical analysis: Compact notation for deriving properties like gradients, Jacobians, and sensitivities
Efficient implementation: Direct mapping to optimized linear algebra libraries (BLAS, cuBLAS)
Parallel execution: Matrix operations naturally parallelize across GPU cores
Generalization: The same formulation extends to CNNs, attention, and other architectures

This page develops the complete matrix formulation for MLPs, from single-sample to batch computation, with attention to the conventions and indexing that make implementation straightforward.

What You Will Master

By the end of this page, you will understand: (1) The complete matrix notation for MLP computation; (2) How batch operations are formulated as matrix-matrix products; (3) The Jacobian structure for gradient analysis; (4) Dimensionality conventions and their implications; (5) Implementation patterns for vectorized code.

Notation and Dimensional Conventions

Precise notation prevents confusion when implementing neural networks. Different sources use different conventions; we establish ours clearly.

Scalar Notation:

$L$: Number of layers (not counting input)
$n_l$: Width (number of units) of layer $l$, where $l \in {0, 1, \ldots, L}$
$n_0$: Input dimension
$n_L$: Output dimension
$B$: Batch size

Vector Notation (single sample):

$\mathbf{x} \in \mathbb{R}^{n_0}$: Input vector (column)
$\mathbf{a}^{(l)} \in \mathbb{R}^{n_l}$: Activation of layer $l$ (column)
$\mathbf{z}^{(l)} \in \mathbb{R}^{n_l}$: Pre-activation of layer $l$ (column)
$\mathbf{b}^{(l)} \in \mathbb{R}^{n_l}$: Bias vector of layer $l$ (column)

Matrix Notation:

$W^{(l)} \in \mathbb{R}^{n_l \times n_{l-1}}$: Weight matrix connecting layer $l-1$ to layer $l$
Row $i$ of $W^{(l)}$: weights for unit $i$ in layer $l$

Batch Notation:

$X \in \mathbb{R}^{B \times n_0}$: Batch of inputs (each row is a sample)
$A^{(l)} \in \mathbb{R}^{B \times n_l}$: Batch of activations
$Z^{(l)} \in \mathbb{R}^{B \times n_l}$: Batch of pre-activations

Dimension Summary for Network (n₀, n₁, ..., nₗ)
Object	Shape (Single)	Shape (Batch B)	Description
Input	(n₀,)	(B, n₀)	Raw features
Weights W⁽ˡ⁾	(nₗ, nₗ₋₁)	(nₗ, nₗ₋₁)	Layer l weights
Biases b⁽ˡ⁾	(nₗ,)	(nₗ,)	Layer l biases
Pre-activation z⁽ˡ⁾	(nₗ,)	(B, nₗ)	Before activation
Activation a⁽ˡ⁾	(nₗ,)	(B, nₗ)	After activation
Output	(nₗ,)	(B, nₗ)	Final prediction

Convention Alternatives

Some texts use the opposite convention: samples as columns (X ∈ ℝⁿ⁰×ᴮ) and weights as W ∈ ℝⁿˡ⁻¹×ⁿˡ. Our convention (samples as rows) matches common framework implementations and makes broadcasting more intuitive. Always verify the convention when reading papers or adapting code.

Single Sample Matrix Formulation

For a single input sample, forward propagation is a sequence of matrix-vector operations.

Layer-wise Computation:

Given input $\mathbf{a}^{(0)} = \mathbf{x}$, for each layer $l = 1, \ldots, L$:

$$\mathbf{z}^{(l)} = W^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}$$

$$\mathbf{a}^{(l)} = \sigma^{(l)}(\mathbf{z}^{(l)})$$

Component Form:

For unit $j$ in layer $l$:

$$z_j^{(l)} = \sum_{i=1}^{n_{l-1}} W_{ji}^{(l)} a_i^{(l-1)} + b_j^{(l)}$$

$$a_j^{(l)} = \sigma^{(l)}(z_j^{(l)})$$

The matrix form compactly represents all $n_l$ of these computations simultaneously.

Complete Network Function:

$$f(\mathbf{x}; \Theta) = (\sigma^{(L)} \circ g^{(L)} \circ \sigma^{(L-1)} \circ g^{(L-1)} \circ \cdots \circ \sigma^{(1)} \circ g^{(1)})(\mathbf{x})$$

where $g^{(l)}(\mathbf{a}) = W^{(l)}\mathbf{a} + \mathbf{b}^{(l)}$ is the affine transformation.

Dimensional Analysis:

Consider a network $(784, 256, 128, 10)$:

Layer 1: $\mathbf{z}^{(1)} = \underbrace{W^{(1)}}{256 \times 784} \underbrace{\mathbf{x}}{784 \times 1} + \underbrace{\mathbf{b}^{(1)}}{256 \times 1} \Rightarrow \underbrace{\mathbf{z}^{(1)}}{256 \times 1}$

Layer 2: $\mathbf{z}^{(2)} = \underbrace{W^{(2)}}{128 \times 256} \underbrace{\mathbf{a}^{(1)}}{256 \times 1} + \underbrace{\mathbf{b}^{(2)}}{128 \times 1} \Rightarrow \underbrace{\mathbf{z}^{(2)}}{128 \times 1}$

Layer 3: $\mathbf{z}^{(3)} = \underbrace{W^{(3)}}{10 \times 128} \underbrace{\mathbf{a}^{(2)}}{128 \times 1} + \underbrace{\mathbf{b}^{(3)}}{10 \times 1} \Rightarrow \underbrace{\mathbf{z}^{(3)}}{10 \times 1}$

single_sample_matrix.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
import numpy as np
from typing import List
 
def single_sample_forward(
    x: np.ndarray,  # Shape: (n_input,)
    weights: List[np.ndarray],  # Each W: (n_out, n_in)
    biases: List[np.ndarray],   # Each b: (n_out,)
    activations: List[callable]
) -> np.ndarray:
    """
    Forward propagation for a single sample using matrix-vector operations.
    
    This implementation follows the mathematical notation exactly,
    explicitly showing the matrix-vector multiplications.
    
    Args:
        x: Input vector of shape (n_input,)
        weights: List of weight matrices W^(l), each shape (n_out, n_in)
        biases: List of bias vectors b^(l), each shape (n_out,)
        activations: List of activation functions
    
    Returns:
        output: Network output vector
    """
    a = x  # a^(0) = x
    
    for l, (W, b, sigma) in enumerate(zip(weights, biases, activations)):
        # z^(l) = W^(l) @ a^(l-1) + b^(l)
        # Matrix-vector multiplication: (n_l, n_{l-1}) @ (n_{l-1},) -> (n_l,)
        z = W @ a + b
        
        # a^(l) = σ^(l)(z^(l))
        a = sigma(z)
        
        # Debug output (optional)
        print(f"Layer {l+1}: W{W.shape} @ a{a.shape[0] if l == 0 else len(x)} + b{b.shape} -> z{z.shape} -> a{a.shape}")
    
    return a
 
 
def verify_component_equivalence():
    """
    Verify that matrix formulation equals component-wise computation.
    This demonstrates the equivalence of the two perspectives.
    """
    np.random.seed(42)
    
    # Simple network: 3 -> 4 -> 2
    W1 = np.random.randn(4, 3)
    b1 = np.random.randn(4)
    W2 = np.random.randn(2, 4)
    b2 = np.random.randn(2)
    
    x = np.random.randn(3)
    
    relu = lambda z: np.maximum(0, z)
    
    # Matrix form
    z1_matrix = W1 @ x + b1
    a1_matrix = relu(z1_matrix)
    z2_matrix = W2 @ a1_matrix + b2
    a2_matrix = relu(z2_matrix)
    
    # Component form
    z1_component = np.zeros(4)
    for j in range(4):
        z1_component[j] = sum(W1[j, i] * x[i] for i in range(3)) + b1[j]
    a1_component = relu(z1_component)
    
    z2_component = np.zeros(2)
    for j in range(2):
        z2_component[j] = sum(W2[j, i] * a1_component[i] for i in range(4)) + b2[j]
    a2_component = relu(z2_component)
    
    # Compare
    print("Matrix vs Component Equivalence")
    print("=" * 40)
    print(f"z1 max diff: {np.max(np.abs(z1_matrix - z1_component)):.2e}")
    print(f"a1 max diff: {np.max(np.abs(a1_matrix - a1_component)):.2e}")
    print(f"z2 max diff: {np.max(np.abs(z2_matrix - z2_component)):.2e}")
    print(f"a2 max diff: {np.max(np.abs(a2_matrix - a2_component)):.2e}")
    print("\n✓ Matrix and component forms are mathematically equivalent")
 
 
if __name__ == "__main__":
    verify_component_equivalence()

Batch Matrix Formulation

Batch computation extends the single-sample formulation to process $B$ samples simultaneously using matrix-matrix operations.

Batch Representation:

Stack $B$ input samples as rows: $$X = \begin{bmatrix} — \mathbf{x}^{(1)} — \ — \mathbf{x}^{(2)} — \ \vdots \ — \mathbf{x}^{(B)} — \end{bmatrix} \in \mathbb{R}^{B \times n_0}$$

Batch Forward Computation:

$$Z^{(l)} = A^{(l-1)} W^{(l)\top} + \mathbf{1}_B \mathbf{b}^{(l)\top}$$

$$A^{(l)} = \sigma^{(l)}(Z^{(l)})$$

With broadcasting (as implemented in NumPy/PyTorch):

$$Z^{(l)} = A^{(l-1)} W^{(l)\top} + \mathbf{b}^{(l)}$$

The bias vector $\mathbf{b}^{(l)} \in \mathbb{R}^{n_l}$ broadcasts to all $B$ rows.

Dimensional Analysis:

Consider network $(784, 256, 10)$ with batch $B = 32$:

Layer 1: $$Z^{(1)} = \underbrace{A^{(0)}}{32 \times 784} \underbrace{W^{(1)\top}}{784 \times 256} + \underbrace{\mathbf{b}^{(1)}}{256} \Rightarrow \underbrace{Z^{(1)}}{32 \times 256}$$

Layer 2: $$Z^{(2)} = \underbrace{A^{(1)}}{32 \times 256} \underbrace{W^{(2)\top}}{256 \times 10} + \underbrace{\mathbf{b}^{(2)}}{10} \Rightarrow \underbrace{Z^{(2)}}{32 \times 10}$$

Alternative Convention (Samples as Columns):

Some texts use: $$Z^{(l)} = W^{(l)} A^{(l-1)} + \mathbf{b}^{(l)} \mathbf{1}_B^\top$$

where $A^{(l)} \in \mathbb{R}^{n_l \times B}$ has samples as columns. This avoids the transpose on $W$ but changes broadcasting semantics.

batch_matrix_formulation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
import numpy as np
from typing import List, Tuple
 
def batch_forward_explicit(
    X: np.ndarray,  # Shape: (B, n_input)
    weights: List[np.ndarray],  # Each W: (n_out, n_in)
    biases: List[np.ndarray],   # Each b: (n_out,)
    activations: List[callable]
) -> Tuple[np.ndarray, List[np.ndarray]]:
    """
    Batch forward propagation with explicit matrix operations.
    
    This implementation clearly shows the matrix-matrix multiplications
    and broadcasting for batch processing.
    
    Args:
        X: Input batch of shape (B, n_input)
        weights: List of weight matrices W^(l), each shape (n_out, n_in)
        biases: List of bias vectors b^(l), each shape (n_out,)
        activations: List of activation functions
    
    Returns:
        output: Network output of shape (B, n_output)
        all_activations: All layer activations for backprop
    """
    B = X.shape[0]
    A = X  # A^(0) = X, shape (B, n_0)
    all_activations = [A]
    
    for l, (W, b, sigma) in enumerate(zip(weights, biases, activations)):
        # Z^(l) = A^(l-1) @ W^(l).T + b^(l)
        # Shape: (B, n_{l-1}) @ (n_{l-1}, n_l) + (n_l,) -> (B, n_l)
        # The bias b broadcasts from (n_l,) to (B, n_l)
        Z = A @ W.T + b
        
        # A^(l) = σ^(l)(Z^(l))
        A = sigma(Z)
        all_activations.append(A)
        
        print(f"Layer {l+1}:")
        print(f"  A^({l}) @ W^({l+1}).T + b^({l+1})")
        print(f"  ({A.shape[0]}×{W.shape[1]}) @ ({W.shape[1]}×{W.shape[0]}) + ({b.shape[0]},)")
        print(f"  -> Z: {Z.shape}, A: {A.shape}")
    
    return A, all_activations
 
 
def demonstrate_broadcasting():
    """
    Explain and demonstrate NumPy/PyTorch broadcasting in batch computation.
    """
    print("Broadcasting in Batch Computation")
    print("=" * 50)
    
    B = 4  # Batch size
    n_in = 3
    n_out = 5
    
    # Batch of activations
    A = np.random.randn(B, n_in)
    print(f"Input A: shape {A.shape}")
    print(A)
    
    # Weight matrix
    W = np.random.randn(n_out, n_in)
    print(f"\nWeight W: shape {W.shape}")
    
    # Bias vector
    b = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
    print(f"Bias b: shape {b.shape}")
    print(f"  b = {b}")
    
    # Matrix multiplication
    Z_no_bias = A @ W.T
    print(f"\nA @ W.T: shape {Z_no_bias.shape}")
    
    # Broadcasting: b is added to each row
    Z = Z_no_bias + b
    print(f"(A @ W.T) + b: shape {Z.shape}")
    
    print("\nBroadcasting adds b to each row:")
    for i in range(B):
        print(f"  Row {i}: Z_no_bias[{i}] + b = {Z_no_bias[i]} + {b}")
        print(f"         = {Z[i]}")
    
    # Verify equivalence to explicit addition
    Z_explicit = np.zeros((B, n_out))
    for i in range(B):
        Z_explicit[i] = Z_no_bias[i] + b
    
    print(f"\nMax diff (broadcast vs explicit): {np.max(np.abs(Z - Z_explicit))}")
 
 
def flops_analysis():
    """
    Analyze FLOPs for batch matrix operations.
    """
    print("\nFLOPs Analysis for Batch Computation")
    print("=" * 50)
    
    # Network: 784 -> 512 -> 256 -> 10
    architecture = [784, 512, 256, 10]
    batch_size = 32
    
    total_flops = 0
    
    for l in range(len(architecture) - 1):
        n_in = architecture[l]
        n_out = architecture[l + 1]
        
        # Matrix multiply: (B, n_in) @ (n_in, n_out)
        # FLOPs: B * n_out * n_in * 2 (multiply-add)
        mm_flops = 2 * batch_size * n_out * n_in
        
        # Bias addition: B * n_out
        bias_flops = batch_size * n_out
        
        # Activation (ReLU): B * n_out comparisons
        act_flops = batch_size * n_out
        
        layer_flops = mm_flops + bias_flops + act_flops
        total_flops += layer_flops
        
        print(f"Layer {l+1} ({n_in}→{n_out}):")
        print(f"  MatMul: {mm_flops:,} FLOPs ({100*mm_flops/layer_flops:.1f}%)")
        print(f"  Bias:   {bias_flops:,} FLOPs")
        print(f"  Act:    {act_flops:,} FLOPs")
        print(f"  Total:  {layer_flops:,} FLOPs")
    
    print(f"\nTotal network: {total_flops:,} FLOPs")
    print(f"Per sample: {total_flops // batch_size:,} FLOPs")
 
 
if __name__ == "__main__":
    demonstrate_broadcasting()
    flops_analysis()

The Transpose Convention

Using Z = A @ W.T (with samples as rows) means we store W with output neurons as rows—the same orientation used in the single-sample formula z = Wx. This consistency simplifies code. PyTorch's nn.Linear uses this convention: W has shape (out_features, in_features), matching our W^(l) ∈ ℝⁿˡ×ⁿˡ⁻¹.

Jacobian Structure for Gradient Analysis

Understanding the Jacobian (matrix of partial derivatives) of network operations is essential for analyzing gradient flow and understanding backpropagation.

Jacobian Definition:

For a function $f: \mathbb{R}^n \to \mathbb{R}^m$, the Jacobian $J_f \in \mathbb{R}^{m \times n}$ has entries:

$$(J_f)_{ij} = \frac{\partial f_i}{\partial x_j}$$

Jacobian of Affine Transformation:

For $g(\mathbf{a}) = W\mathbf{a} + \mathbf{b}$:

$$J_g = \frac{\partial (W\mathbf{a} + \mathbf{b})}{\partial \mathbf{a}} = W$$

The Jacobian of a linear transformation is simply the weight matrix itself.

Jacobian of Element-wise Activation:

For $\sigma(\mathbf{z})$ applied element-wise:

$$J_\sigma = \text{diag}(\sigma'(z_1), \sigma'(z_2), \ldots, \sigma'(z_n))$$

A diagonal matrix with activation derivatives on the diagonal.

Chain Rule for Composed Functions:

For composition $h = f \circ g$:

$$J_h = J_f \cdot J_g$$

Jacobians multiply in the order of function application.

Full Forward Pass Jacobian:

For the network function $f(\mathbf{x}) = \mathbf{a}^{(L)}$, the Jacobian $\frac{\partial \mathbf{a}^{(L)}}{\partial \mathbf{x}}$ is:

$$\frac{\partial \mathbf{a}^{(L)}}{\partial \mathbf{x}} = \left(\prod_{l=L}^{1} D^{(l)} W^{(l)}\right)$$

where $D^{(l)} = \text{diag}(\sigma'^{(l)}(\mathbf{z}^{(l)}))$ is the diagonal matrix of activation derivatives at layer $l$.

Expanded:

$$\frac{\partial \mathbf{a}^{(L)}}{\partial \mathbf{x}} = D^{(L)} W^{(L)} D^{(L-1)} W^{(L-1)} \cdots D^{(1)} W^{(1)}$$

Gradient of Loss w.r.t. Input:

If $\mathcal{L} = \ell(\mathbf{a}^{(L)})$ for some scalar loss $\ell$:

$$\frac{\partial \mathcal{L}}{\partial \mathbf{x}} = \left(\frac{\partial \mathbf{a}^{(L)}}{\partial \mathbf{x}}\right)^\top \frac{\partial \mathcal{L}}{\partial \mathbf{a}^{(L)}}$$

This is exactly what backpropagation computes, layer by layer, without forming the full Jacobian.

jacobian_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
import numpy as np
from typing import List, Tuple
 
def relu(z):
    return np.maximum(0, z)
 
def relu_derivative(z):
    """Derivative of ReLU: 1 if z > 0, else 0."""
    return (z > 0).astype(float)
 
def compute_layer_jacobian(
    W: np.ndarray,
    z: np.ndarray,
    activation_derivative: callable
) -> np.ndarray:
    """
    Compute the Jacobian of a single layer: J = D @ W
    
    Args:
        W: Weight matrix, shape (n_out, n_in)
        z: Pre-activation vector, shape (n_out,)
        activation_derivative: σ'(z) function
    
    Returns:
        Jacobian of shape (n_out, n_in)
    """
    # D is diagonal matrix of activation derivatives
    deriv = activation_derivative(z)  # Shape (n_out,)
    D = np.diag(deriv)  # Shape (n_out, n_out)
    
    # J = D @ W
    J = D @ W
    return J
 
 
def compute_full_jacobian(
    x: np.ndarray,
    weights: List[np.ndarray],
    biases: List[np.ndarray],
    activation_derivative: callable
) -> Tuple[np.ndarray, List[np.ndarray]]:
    """
    Compute full network Jacobian ∂a^(L)/∂x.
    
    This explicitly constructs the Jacobian for demonstration.
    In practice, we use backpropagation instead.
    
    Returns:
        full_jacobian: ∂a^(L)/∂x of shape (n_L, n_0)
        layer_jacobians: List of per-layer Jacobians
    """
    # Forward pass to get z values
    a = x
    z_values = []
    for W, b in zip(weights, biases):
        z = W @ a + b
        z_values.append(z)
        a = relu(z)
    
    # Compute layer Jacobians
    layer_jacobians = []
    for l in range(len(weights)):
        J_l = compute_layer_jacobian(weights[l], z_values[l], activation_derivative)
        layer_jacobians.append(J_l)
    
    # Chain together: J = J_L @ J_{L-1} @ ... @ J_1
    full_jacobian = layer_jacobians[-1]
    for l in range(len(weights) - 2, -1, -1):
        full_jacobian = full_jacobian @ layer_jacobians[l]
    
    return full_jacobian, layer_jacobians
 
 
def verify_jacobian_numerically(
    x: np.ndarray,
    weights: List[np.ndarray],
    biases: List[np.ndarray],
    eps: float = 1e-5
) -> np.ndarray:
    """
    Compute Jacobian numerically via finite differences.
    Used to verify analytical Jacobian computation.
    """
    def forward(x):
        a = x
        for W, b in zip(weights, biases):
            z = W @ a + b
            a = relu(z)
        return a
    
    n_in = x.shape[0]
    n_out = forward(x).shape[0]
    
    J_numerical = np.zeros((n_out, n_in))
    
    for i in range(n_in):
        x_plus = x.copy()
        x_plus[i] += eps
        x_minus = x.copy()
        x_minus[i] -= eps
        
        J_numerical[:, i] = (forward(x_plus) - forward(x_minus)) / (2 * eps)
    
    return J_numerical
 
 
def analyze_jacobian_properties():
    """
    Analyze Jacobian to understand gradient flow.
    """
    np.random.seed(42)
    
    # Simple network: 4 -> 8 -> 6 -> 3
    weights = [
        np.random.randn(8, 4) * 0.5,
        np.random.randn(6, 8) * 0.5,
        np.random.randn(3, 6) * 0.5,
    ]
    biases = [np.zeros(8), np.zeros(6), np.zeros(3)]
    
    x = np.random.randn(4)
    
    # Compute analytical Jacobian
    J_analytical, layer_Js = compute_full_jacobian(
        x, weights, biases, relu_derivative
    )
    
    # Compute numerical Jacobian
    J_numerical = verify_jacobian_numerically(x, weights, biases)
    
    print("Jacobian Analysis")
    print("=" * 50)
    print(f"Network: 4 → 8 → 6 → 3")
    print(f"Full Jacobian shape: {J_analytical.shape}")
    
    print("\nLayer Jacobians:")
    for l, J in enumerate(layer_Js):
        print(f"  Layer {l+1}: shape {J.shape}, "
              f"rank={np.linalg.matrix_rank(J)}, "
              f"norm={np.linalg.norm(J):.4f}")
    
    # Verify correctness
    max_diff = np.max(np.abs(J_analytical - J_numerical))
    print(f"\nMax diff (analytical vs numerical): {max_diff:.2e}")
    
    # Analyze singular values for gradient flow
    U, S, Vh = np.linalg.svd(J_analytical)
    print(f"\nFull Jacobian singular values: {S}")
    print(f"Condition number: {S.max() / S.min():.2f}")
    print("(Large condition number → gradient magnification/shrinkage)")
 
 
if __name__ == "__main__":
    analyze_jacobian_properties()

Jacobian Size Explosion

Explicitly constructing the full Jacobian is O(n_L × n_0) in space and expensive to compute. For realistic networks, this is infeasible. Backpropagation instead computes the product Jᵀv (Jacobian-vector product) efficiently in O(parameters) time, without ever forming J explicitly. This is why we study Jacobian structure for analysis but use backpropagation for computation.

Linear Algebra Optimizations

Efficient neural network implementation leverages decades of linear algebra optimization. Understanding these optimizations explains performance characteristics.

BLAS Levels:

BLAS (Basic Linear Algebra Subprograms) defines three levels of operations:

Level	Operation	FLOPs	Memory	Arithmetic Intensity
1	Vector-vector (dot product)	O(n)	O(n)	O(1)
2	Matrix-vector	O(n²)	O(n²)	O(1)
3	Matrix-matrix	O(n³)	O(n²)	O(n)

Arithmetic intensity = FLOPs / Memory accesses

Higher intensity = better hardware utilization.

Why Batch Processing Wins:

Single-sample: Level-2 BLAS (matrix-vector) Batch processing: Level-3 BLAS (matrix-matrix)

Level-3 operations can reuse data in fast cache, reducing memory bandwidth bottleneck.

The Memory Hierarchy:

Level	Size	Latency	Bandwidth
Registers	KB	0 cycles	N/A
L1 Cache	32-64 KB	4 cycles	~1 TB/s
L2 Cache	256 KB	~12 cycles	~500 GB/s
L3 Cache	8-32 MB	~40 cycles	~200 GB/s
Main Memory	GB	~200 cycles	~50 GB/s
GPU Global Memory	GB	~400 cycles	~900 GB/s

Matrix-matrix multiplication algorithms (like Strassen, or tiled implementations) exploit cache hierarchy to minimize slow memory access.

Key Optimizations in Practice

•Tiled/Blocked Matrix Multiplication: Divide matrices into tiles that fit in cache: compute partial products tile-by-tile, only loading each tile from memory once.
•Operation Fusion: Combine matmul + bias + activation into single kernel launch, avoiding intermediate memory writes. Modern frameworks (TensorRT, XLA) do this automatically.
•Width Alignment: Layer widths as multiples of 32/64 align with GPU warp size and SIMD registers, maximizing throughput.
•Tensor Cores (NVIDIA): Specialized hardware for 4×4 matrix operations at mixed precision. Requires specific matrix dimensions (multiples of 8/16) to utilize.
•Memory Layout: Row-major vs column-major storage affects cache efficiency. Frameworks handle this, but awareness helps when debugging performance.

matrix_optimization_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
import numpy as np
import time
 
def benchmark_matrix_operations():
    """
    Compare performance of different operation patterns.
    Demonstrates why batching and matrix ops are preferred.
    """
    np.random.seed(42)
    n = 512
    k = 256
    B = 128  # Batch size
    
    # Matrices
    A = np.random.randn(B, n).astype(np.float32)  # Batch of vectors
    W = np.random.randn(k, n).astype(np.float32)  # Weight matrix
    b = np.random.randn(k).astype(np.float32)     # Bias
    
    n_trials = 100
    
    print("Matrix Operation Benchmarks")
    print("=" * 50)
    print(f"Batch size: {B}, Input dim: {n}, Output dim: {k}")
    print("-" * 50)
    
    # Method 1: Loop over samples (Level-2 BLAS)
    start = time.time()
    for _ in range(n_trials):
        results = np.zeros((B, k), dtype=np.float32)
        for i in range(B):
            results[i] = W @ A[i] + b
    loop_time = time.time() - start
    print(f"Loop (Level-2): {1000*loop_time/n_trials:.3f} ms/batch")
    
    # Method 2: Batch matrix multiply (Level-3 BLAS)
    start = time.time()
    for _ in range(n_trials):
        results = A @ W.T + b
    batch_time = time.time() - start
    print(f"Batch (Level-3): {1000*batch_time/n_trials:.3f} ms/batch")
    
    speedup = loop_time / batch_time
    print(f"\nSpeedup: {speedup:.1f}x")
    
    # Method 3: Show FLOP analysis
    flops_per_batch = 2 * B * k * n + B * k  # matmul + bias
    gflops_batch = (flops_per_batch * n_trials) / (batch_time * 1e9)
    print(f"\nBatch throughput: {gflops_batch:.2f} GFLOPS")
 
 
def memory_access_analysis():
    """
    Analyze memory access patterns in matrix operations.
    """
    print("\nMemory Access Analysis")
    print("=" * 50)
    
    # For Z = A @ W.T + b
    # Where A: (B, n), W: (k, n), b: (k,)
    B, n, k = 128, 512, 256
    
    print(f"Operation: Z = A @ W.T + b")
    print(f"A: ({B}, {n}), W: ({k}, {n}), b: ({k},)")
    print()
    
    # Memory reads
    A_bytes = B * n * 4  # float32
    W_bytes = k * n * 4
    b_bytes = k * 4
    total_read = A_bytes + W_bytes + b_bytes
    
    # Memory writes
    Z_bytes = B * k * 4
    
    # FLOPs
    flops = 2 * B * k * n + B * k  # matmul + bias
    
    # Arithmetic intensity
    intensity = flops / (total_read + Z_bytes)
    
    print(f"Memory reads: {total_read / 1024:.1f} KB")
    print(f"  A: {A_bytes / 1024:.1f} KB")
    print(f"  W: {W_bytes / 1024:.1f} KB")  
    print(f"  b: {b_bytes / 1024:.2f} KB")
    print(f"Memory writes: {Z_bytes / 1024:.1f} KB")
    print(f"Total memory: {(total_read + Z_bytes) / 1024:.1f} KB")
    print(f"\nFLOPs: {flops:,}")
    print(f"Arithmetic Intensity: {intensity:.1f} FLOPs/byte")
    
    # Compare with single sample (B=1)
    B_single = 1
    A_bytes_s = B_single * n * 4
    Z_bytes_s = B_single * k * 4
    flops_s = 2 * B_single * k * n + B_single * k
    total_mem_s = A_bytes_s + W_bytes + b_bytes + Z_bytes_s
    intensity_s = flops_s / total_mem_s
    
    print(f"\nSingle sample intensity: {intensity_s:.2f} FLOPs/byte")
    print(f"Batch intensity improvement: {intensity / intensity_s:.1f}x")
 
 
if __name__ == "__main__":
    benchmark_matrix_operations()
    memory_access_analysis()

The Roofline Model

The roofline model visualizes the relationship between arithmetic intensity and achievable performance. Operations with low intensity (like single-sample inference) are memory-bound—performance limited by memory bandwidth. High-intensity operations (large batch matrix multiplies) are compute-bound—limited by processor speed. Understanding this helps choose batch sizes and identify bottlenecks.

Implementation Patterns and Best Practices

Translating matrix formulations into efficient code requires attention to several practical concerns.

Consistent Shape Handling:

Always work with 2D tensors for input: $(B, n)$ even for $B=1$
This avoids special-casing and maintains consistent code paths
Use reshape or unsqueeze to add batch dimension

In-Place Operations:

Avoid unnecessary tensor allocations
In-place activations save memory: torch.relu_(x)
Be careful: in-place breaks gradient computation if done incorrectly

Fused Operations:

Many frameworks provide fused operations
F.linear(x, W, b): matmul + bias in one operation
F.relu(x, inplace=True): saves memory

Data Types:

float32 is the default; sufficient for most training
float16 (half precision): 2x memory savings, faster on tensor cores
bfloat16: better numerical properties than float16
int8/int4: for inference quantization

implementation_best_practices.py
PyTorch
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class EfficientMLP(nn.Module):
    """
    MLP implementation following best practices for efficiency.
    
    Key practices:
    - Fused operations where possible
    - Proper initialization
    - Optional mixed precision support
    - Efficient forward pass implementation
    """
    
    def __init__(
        self,
        in_features: int,
        hidden_features: list,
        out_features: int,
        activation: str = "relu",
        use_bias: bool = True,
    ):
        super().__init__()
        
        all_sizes = [in_features] + hidden_features + [out_features]
        
        # Build layers
        self.layers = nn.ModuleList()
        for i in range(len(all_sizes) - 1):
            self.layers.append(
                nn.Linear(all_sizes[i], all_sizes[i+1], bias=use_bias)
            )
        
        # Store activation choice
        self.activation = activation
        self.num_hidden = len(hidden_features)
        
        # Initialize weights
        self._init_weights()
    
    def _init_weights(self):
        """Apply He initialization for ReLU networks."""
        for module in self.layers:
            if isinstance(module, nn.Linear):
                nn.init.kaiming_normal_(module.weight, nonlinearity='relu')
                if module.bias is not None:
                    nn.init.zeros_(module.bias)
    
    def _get_activation(self, x: torch.Tensor, inplace: bool = True) -> torch.Tensor:
        """Apply activation function efficiently."""
        if self.activation == "relu":
            return F.relu(x, inplace=inplace)
        elif self.activation == "gelu":
            return F.gelu(x)
        elif self.activation == "silu":
            return F.silu(x, inplace=inplace)
        else:
            raise ValueError(f"Unknown activation: {self.activation}")
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Efficient forward pass.
        
        Uses in-place operations for activations where safe.
        """
        # Ensure 2D input
        if x.dim() == 1:
            x = x.unsqueeze(0)
        
        # Hidden layers with activation
        for layer in self.layers[:-1]:
            # F.linear is a fused matmul + bias
            x = self._get_activation(layer(x), inplace=True)
        
        # Output layer (no activation)
        x = self.layers[-1](x)
        
        return x
 
 
class NumericallyStableOutput(nn.Module):
    """
    Numerically stable output layer for classification.
    
    Combines log_softmax with the loss for stable training.
    For inference, can use softmax normally.
    """
    
    def __init__(self, in_features: int, num_classes: int):
        super().__init__()
        self.linear = nn.Linear(in_features, num_classes)
    
    def forward(self, x: torch.Tensor, return_logits: bool = True) -> torch.Tensor:
        """
        Forward pass returning logits or log-probabilities.
        
        Args:
            x: Input features
            return_logits: If True, return raw logits (for use with CrossEntropyLoss)
                          If False, return log_softmax (for NLLLoss or analysis)
        """
        logits = self.linear(x)
        
        if return_logits:
            # CrossEntropyLoss combines log_softmax + nll internally
            return logits
        else:
            # Numerically stable log-softmax
            return F.log_softmax(logits, dim=-1)
    
    def predict_proba(self, x: torch.Tensor) -> torch.Tensor:
        """Return probabilities for inference."""
        with torch.no_grad():
            logits = self.linear(x)
            return F.softmax(logits, dim=-1)
 
 
# Demonstration
def demonstrate_efficient_implementation():
    """Show efficient MLP in action."""
    
    # Create model
    model = EfficientMLP(
        in_features=784,
        hidden_features=[512, 256, 128],
        out_features=10,
        activation="relu"
    )
    
    # Count parameters
    total_params = sum(p.numel() for p in model.parameters())
    print(f"Model parameters: {total_params:,}")
    
    # Test forward pass
    batch = torch.randn(64, 784)
    
    # Warm up (compilation, caching)
    _ = model(batch)
    
    # Benchmark
    import time
    n_iterations = 100
    
    start = time.time()
    for _ in range(n_iterations):
        output = model(batch)
    elapsed = time.time() - start
    
    print(f"Forward pass: {1000 * elapsed / n_iterations:.3f} ms/batch")
    print(f"Throughput: {64 * n_iterations / elapsed:.0f} samples/sec")
 
 
if __name__ == "__main__":
    demonstrate_efficient_implementation()

Summary: Matrix Formulation Mastery

Matrix formulation transforms neural network computation from abstract algebra into implementable, efficient operations. This foundation enables both theoretical analysis and practical implementation.

Key Concepts Mastered

•Notation conventions: Consistent indexing (samples as rows, W ∈ ℝⁿˡ×ⁿˡ⁻¹) enables clear communication and correct implementation.
•Single-sample form: z = Wa + b captures the affine transformation; a = σ(z) the nonlinearity.
•Batch form: Z = AW^T + b processes B samples simultaneously via matrix-matrix multiplication.
•Jacobian structure: ∂a^(L)/∂x = ∏(Dₗ Wₗ) reveals gradient flow; diagonal Dₗ from element-wise activations.
•BLAS optimization: Level-3 operations (matrix-matrix) maximize hardware efficiency through cache reuse.
•Implementation patterns: Fused operations, in-place activations, proper initialization, and consistent shape handling.

What's Next:

With the computational framework established, we turn to the final core component: Activation Functions. These nonlinear functions are what make neural networks more than stacked linear transformations. We'll examine their mathematical properties, gradient behavior, and selection guidelines for different architectures.

Page Complete

You now command the matrix formulation of neural network computation. This understanding directly maps to code: when you write output = X @ W.T + b, you know exactly what mathematical operation is performed, how it relates to the network's function class, and why this representation enables efficient computation. This foundation makes backpropagation, which we'll cover soon, much more intuitive.

4 / 5

Loading learning content...

Neural Networks & Deep LearningMulti-Layer Perceptrons

Multi-Layer Perceptrons

LevelIntermediate

Duration90 mins

TopicMulti-Layer Perceptrons

4 / 5

Matrix Formulation

The Power of Matrix Notation

Theoretical analysis: Compact notation for deriving properties like gradients, Jacobians, and sensitivities
Efficient implementation: Direct mapping to optimized linear algebra libraries (BLAS, cuBLAS)
Parallel execution: Matrix operations naturally parallelize across GPU cores
Generalization: The same formulation extends to CNNs, attention, and other architectures

This page develops the complete matrix formulation for MLPs, from single-sample to batch computation, with attention to the conventions and indexing that make implementation straightforward.

What You Will Master

Notation and Dimensional Conventions

Precise notation prevents confusion when implementing neural networks. Different sources use different conventions; we establish ours clearly.

Scalar Notation:

$L$: Number of layers (not counting input)
$n_l$: Width (number of units) of layer $l$, where $l \in {0, 1, \ldots, L}$
$n_0$: Input dimension
$n_L$: Output dimension
$B$: Batch size

Vector Notation (single sample):

$\mathbf{x} \in \mathbb{R}^{n_0}$: Input vector (column)
$\mathbf{a}^{(l)} \in \mathbb{R}^{n_l}$: Activation of layer $l$ (column)
$\mathbf{z}^{(l)} \in \mathbb{R}^{n_l}$: Pre-activation of layer $l$ (column)
$\mathbf{b}^{(l)} \in \mathbb{R}^{n_l}$: Bias vector of layer $l$ (column)

Matrix Notation:

$W^{(l)} \in \mathbb{R}^{n_l \times n_{l-1}}$: Weight matrix connecting layer $l-1$ to layer $l$
Row $i$ of $W^{(l)}$: weights for unit $i$ in layer $l$

Batch Notation:

$X \in \mathbb{R}^{B \times n_0}$: Batch of inputs (each row is a sample)
$A^{(l)} \in \mathbb{R}^{B \times n_l}$: Batch of activations
$Z^{(l)} \in \mathbb{R}^{B \times n_l}$: Batch of pre-activations

Dimension Summary for Network (n₀, n₁, ..., nₗ)
Object	Shape (Single)	Shape (Batch B)	Description
Input	(n₀,)	(B, n₀)	Raw features
Weights W⁽ˡ⁾	(nₗ, nₗ₋₁)	(nₗ, nₗ₋₁)	Layer l weights
Biases b⁽ˡ⁾	(nₗ,)	(nₗ,)	Layer l biases
Pre-activation z⁽ˡ⁾	(nₗ,)	(B, nₗ)	Before activation
Activation a⁽ˡ⁾	(nₗ,)	(B, nₗ)	After activation
Output	(nₗ,)	(B, nₗ)	Final prediction

Convention Alternatives

Single Sample Matrix Formulation

For a single input sample, forward propagation is a sequence of matrix-vector operations.

Layer-wise Computation:

Given input $\mathbf{a}^{(0)} = \mathbf{x}$, for each layer $l = 1, \ldots, L$:

$$\mathbf{z}^{(l)} = W^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}$$

$$\mathbf{a}^{(l)} = \sigma^{(l)}(\mathbf{z}^{(l)})$$

Component Form:

For unit $j$ in layer $l$:

$$z_j^{(l)} = \sum_{i=1}^{n_{l-1}} W_{ji}^{(l)} a_i^{(l-1)} + b_j^{(l)}$$

$$a_j^{(l)} = \sigma^{(l)}(z_j^{(l)})$$

The matrix form compactly represents all $n_l$ of these computations simultaneously.

Complete Network Function:

$$f(\mathbf{x}; \Theta) = (\sigma^{(L)} \circ g^{(L)} \circ \sigma^{(L-1)} \circ g^{(L-1)} \circ \cdots \circ \sigma^{(1)} \circ g^{(1)})(\mathbf{x})$$

where $g^{(l)}(\mathbf{a}) = W^{(l)}\mathbf{a} + \mathbf{b}^{(l)}$ is the affine transformation.

Dimensional Analysis:

Consider a network $(784, 256, 128, 10)$:

single_sample_matrix.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
import numpy as np
from typing import List
 
def single_sample_forward(
    x: np.ndarray,  # Shape: (n_input,)
    weights: List[np.ndarray],  # Each W: (n_out, n_in)
    biases: List[np.ndarray],   # Each b: (n_out,)
    activations: List[callable]
) -> np.ndarray:
    """
    Forward propagation for a single sample using matrix-vector operations.
    
    This implementation follows the mathematical notation exactly,
    explicitly showing the matrix-vector multiplications.
    
    Args:
        x: Input vector of shape (n_input,)
        weights: List of weight matrices W^(l), each shape (n_out, n_in)
        biases: List of bias vectors b^(l), each shape (n_out,)
        activations: List of activation functions
    
    Returns:
        output: Network output vector
    """
    a = x  # a^(0) = x
    
    for l, (W, b, sigma) in enumerate(zip(weights, biases, activations)):
        # z^(l) = W^(l) @ a^(l-1) + b^(l)
        # Matrix-vector multiplication: (n_l, n_{l-1}) @ (n_{l-1},) -> (n_l,)
        z = W @ a + b
        
        # a^(l) = σ^(l)(z^(l))
        a = sigma(z)
        
        # Debug output (optional)
        print(f"Layer {l+1}: W{W.shape} @ a{a.shape[0] if l == 0 else len(x)} + b{b.shape} -> z{z.shape} -> a{a.shape}")
    
    return a
 
 
def verify_component_equivalence():
    """
    Verify that matrix formulation equals component-wise computation.
    This demonstrates the equivalence of the two perspectives.
    """
    np.random.seed(42)
    
    # Simple network: 3 -> 4 -> 2
    W1 = np.random.randn(4, 3)
    b1 = np.random.randn(4)
    W2 = np.random.randn(2, 4)
    b2 = np.random.randn(2)
    
    x = np.random.randn(3)
    
    relu = lambda z: np.maximum(0, z)
    
    # Matrix form
    z1_matrix = W1 @ x + b1
    a1_matrix = relu(z1_matrix)
    z2_matrix = W2 @ a1_matrix + b2
    a2_matrix = relu(z2_matrix)
    
    # Component form
    z1_component = np.zeros(4)
    for j in range(4):
        z1_component[j] = sum(W1[j, i] * x[i] for i in range(3)) + b1[j]
    a1_component = relu(z1_component)
    
    z2_component = np.zeros(2)
    for j in range(2):
        z2_component[j] = sum(W2[j, i] * a1_component[i] for i in range(4)) + b2[j]
    a2_component = relu(z2_component)
    
    # Compare
    print("Matrix vs Component Equivalence")
    print("=" * 40)
    print(f"z1 max diff: {np.max(np.abs(z1_matrix - z1_component)):.2e}")
    print(f"a1 max diff: {np.max(np.abs(a1_matrix - a1_component)):.2e}")
    print(f"z2 max diff: {np.max(np.abs(z2_matrix - z2_component)):.2e}")
    print(f"a2 max diff: {np.max(np.abs(a2_matrix - a2_component)):.2e}")
    print("\n✓ Matrix and component forms are mathematically equivalent")
 
 
if __name__ == "__main__":
    verify_component_equivalence()

Batch Matrix Formulation

Batch computation extends the single-sample formulation to process $B$ samples simultaneously using matrix-matrix operations.

Batch Representation:

Stack $B$ input samples as rows: $$X = \begin{bmatrix} — \mathbf{x}^{(1)} — \ — \mathbf{x}^{(2)} — \ \vdots \ — \mathbf{x}^{(B)} — \end{bmatrix} \in \mathbb{R}^{B \times n_0}$$

Batch Forward Computation:

$$Z^{(l)} = A^{(l-1)} W^{(l)\top} + \mathbf{1}_B \mathbf{b}^{(l)\top}$$

$$A^{(l)} = \sigma^{(l)}(Z^{(l)})$$

With broadcasting (as implemented in NumPy/PyTorch):

$$Z^{(l)} = A^{(l-1)} W^{(l)\top} + \mathbf{b}^{(l)}$$

The bias vector $\mathbf{b}^{(l)} \in \mathbb{R}^{n_l}$ broadcasts to all $B$ rows.

Dimensional Analysis:

Consider network $(784, 256, 10)$ with batch $B = 32$:

Layer 1: $$Z^{(1)} = \underbrace{A^{(0)}}{32 \times 784} \underbrace{W^{(1)\top}}{784 \times 256} + \underbrace{\mathbf{b}^{(1)}}{256} \Rightarrow \underbrace{Z^{(1)}}{32 \times 256}$$

Layer 2: $$Z^{(2)} = \underbrace{A^{(1)}}{32 \times 256} \underbrace{W^{(2)\top}}{256 \times 10} + \underbrace{\mathbf{b}^{(2)}}{10} \Rightarrow \underbrace{Z^{(2)}}{32 \times 10}$$

Alternative Convention (Samples as Columns):

Some texts use: $$Z^{(l)} = W^{(l)} A^{(l-1)} + \mathbf{b}^{(l)} \mathbf{1}_B^\top$$

where $A^{(l)} \in \mathbb{R}^{n_l \times B}$ has samples as columns. This avoids the transpose on $W$ but changes broadcasting semantics.

batch_matrix_formulation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
import numpy as np
from typing import List, Tuple
 
def batch_forward_explicit(
    X: np.ndarray,  # Shape: (B, n_input)
    weights: List[np.ndarray],  # Each W: (n_out, n_in)
    biases: List[np.ndarray],   # Each b: (n_out,)
    activations: List[callable]
) -> Tuple[np.ndarray, List[np.ndarray]]:
    """
    Batch forward propagation with explicit matrix operations.
    
    This implementation clearly shows the matrix-matrix multiplications
    and broadcasting for batch processing.
    
    Args:
        X: Input batch of shape (B, n_input)
        weights: List of weight matrices W^(l), each shape (n_out, n_in)
        biases: List of bias vectors b^(l), each shape (n_out,)
        activations: List of activation functions
    
    Returns:
        output: Network output of shape (B, n_output)
        all_activations: All layer activations for backprop
    """
    B = X.shape[0]
    A = X  # A^(0) = X, shape (B, n_0)
    all_activations = [A]
    
    for l, (W, b, sigma) in enumerate(zip(weights, biases, activations)):
        # Z^(l) = A^(l-1) @ W^(l).T + b^(l)
        # Shape: (B, n_{l-1}) @ (n_{l-1}, n_l) + (n_l,) -> (B, n_l)
        # The bias b broadcasts from (n_l,) to (B, n_l)
        Z = A @ W.T + b
        
        # A^(l) = σ^(l)(Z^(l))
        A = sigma(Z)
        all_activations.append(A)
        
        print(f"Layer {l+1}:")
        print(f"  A^({l}) @ W^({l+1}).T + b^({l+1})")
        print(f"  ({A.shape[0]}×{W.shape[1]}) @ ({W.shape[1]}×{W.shape[0]}) + ({b.shape[0]},)")
        print(f"  -> Z: {Z.shape}, A: {A.shape}")
    
    return A, all_activations
 
 
def demonstrate_broadcasting():
    """
    Explain and demonstrate NumPy/PyTorch broadcasting in batch computation.
    """
    print("Broadcasting in Batch Computation")
    print("=" * 50)
    
    B = 4  # Batch size
    n_in = 3
    n_out = 5
    
    # Batch of activations
    A = np.random.randn(B, n_in)
    print(f"Input A: shape {A.shape}")
    print(A)
    
    # Weight matrix
    W = np.random.randn(n_out, n_in)
    print(f"\nWeight W: shape {W.shape}")
    
    # Bias vector
    b = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
    print(f"Bias b: shape {b.shape}")
    print(f"  b = {b}")
    
    # Matrix multiplication
    Z_no_bias = A @ W.T
    print(f"\nA @ W.T: shape {Z_no_bias.shape}")
    
    # Broadcasting: b is added to each row
    Z = Z_no_bias + b
    print(f"(A @ W.T) + b: shape {Z.shape}")
    
    print("\nBroadcasting adds b to each row:")
    for i in range(B):
        print(f"  Row {i}: Z_no_bias[{i}] + b = {Z_no_bias[i]} + {b}")
        print(f"         = {Z[i]}")
    
    # Verify equivalence to explicit addition
    Z_explicit = np.zeros((B, n_out))
    for i in range(B):
        Z_explicit[i] = Z_no_bias[i] + b
    
    print(f"\nMax diff (broadcast vs explicit): {np.max(np.abs(Z - Z_explicit))}")
 
 
def flops_analysis():
    """
    Analyze FLOPs for batch matrix operations.
    """
    print("\nFLOPs Analysis for Batch Computation")
    print("=" * 50)
    
    # Network: 784 -> 512 -> 256 -> 10
    architecture = [784, 512, 256, 10]
    batch_size = 32
    
    total_flops = 0
    
    for l in range(len(architecture) - 1):
        n_in = architecture[l]
        n_out = architecture[l + 1]
        
        # Matrix multiply: (B, n_in) @ (n_in, n_out)
        # FLOPs: B * n_out * n_in * 2 (multiply-add)
        mm_flops = 2 * batch_size * n_out * n_in
        
        # Bias addition: B * n_out
        bias_flops = batch_size * n_out
        
        # Activation (ReLU): B * n_out comparisons
        act_flops = batch_size * n_out
        
        layer_flops = mm_flops + bias_flops + act_flops
        total_flops += layer_flops
        
        print(f"Layer {l+1} ({n_in}→{n_out}):")
        print(f"  MatMul: {mm_flops:,} FLOPs ({100*mm_flops/layer_flops:.1f}%)")
        print(f"  Bias:   {bias_flops:,} FLOPs")
        print(f"  Act:    {act_flops:,} FLOPs")
        print(f"  Total:  {layer_flops:,} FLOPs")
    
    print(f"\nTotal network: {total_flops:,} FLOPs")
    print(f"Per sample: {total_flops // batch_size:,} FLOPs")
 
 
if __name__ == "__main__":
    demonstrate_broadcasting()
    flops_analysis()

The Transpose Convention

Jacobian Structure for Gradient Analysis

Understanding the Jacobian (matrix of partial derivatives) of network operations is essential for analyzing gradient flow and understanding backpropagation.

Jacobian Definition:

For a function $f: \mathbb{R}^n \to \mathbb{R}^m$, the Jacobian $J_f \in \mathbb{R}^{m \times n}$ has entries:

$$(J_f)_{ij} = \frac{\partial f_i}{\partial x_j}$$

Jacobian of Affine Transformation:

For $g(\mathbf{a}) = W\mathbf{a} + \mathbf{b}$:

$$J_g = \frac{\partial (W\mathbf{a} + \mathbf{b})}{\partial \mathbf{a}} = W$$

The Jacobian of a linear transformation is simply the weight matrix itself.

Jacobian of Element-wise Activation:

For $\sigma(\mathbf{z})$ applied element-wise:

$$J_\sigma = \text{diag}(\sigma'(z_1), \sigma'(z_2), \ldots, \sigma'(z_n))$$

A diagonal matrix with activation derivatives on the diagonal.

Chain Rule for Composed Functions:

For composition $h = f \circ g$:

$$J_h = J_f \cdot J_g$$

Jacobians multiply in the order of function application.

Full Forward Pass Jacobian:

For the network function $f(\mathbf{x}) = \mathbf{a}^{(L)}$, the Jacobian $\frac{\partial \mathbf{a}^{(L)}}{\partial \mathbf{x}}$ is:

$$\frac{\partial \mathbf{a}^{(L)}}{\partial \mathbf{x}} = \left(\prod_{l=L}^{1} D^{(l)} W^{(l)}\right)$$

where $D^{(l)} = \text{diag}(\sigma'^{(l)}(\mathbf{z}^{(l)}))$ is the diagonal matrix of activation derivatives at layer $l$.

Expanded:

$$\frac{\partial \mathbf{a}^{(L)}}{\partial \mathbf{x}} = D^{(L)} W^{(L)} D^{(L-1)} W^{(L-1)} \cdots D^{(1)} W^{(1)}$$

Gradient of Loss w.r.t. Input:

If $\mathcal{L} = \ell(\mathbf{a}^{(L)})$ for some scalar loss $\ell$:

$$\frac{\partial \mathcal{L}}{\partial \mathbf{x}} = \left(\frac{\partial \mathbf{a}^{(L)}}{\partial \mathbf{x}}\right)^\top \frac{\partial \mathcal{L}}{\partial \mathbf{a}^{(L)}}$$

This is exactly what backpropagation computes, layer by layer, without forming the full Jacobian.

jacobian_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
import numpy as np
from typing import List, Tuple
 
def relu(z):
    return np.maximum(0, z)
 
def relu_derivative(z):
    """Derivative of ReLU: 1 if z > 0, else 0."""
    return (z > 0).astype(float)
 
def compute_layer_jacobian(
    W: np.ndarray,
    z: np.ndarray,
    activation_derivative: callable
) -> np.ndarray:
    """
    Compute the Jacobian of a single layer: J = D @ W
    
    Args:
        W: Weight matrix, shape (n_out, n_in)
        z: Pre-activation vector, shape (n_out,)
        activation_derivative: σ'(z) function
    
    Returns:
        Jacobian of shape (n_out, n_in)
    """
    # D is diagonal matrix of activation derivatives
    deriv = activation_derivative(z)  # Shape (n_out,)
    D = np.diag(deriv)  # Shape (n_out, n_out)
    
    # J = D @ W
    J = D @ W
    return J
 
 
def compute_full_jacobian(
    x: np.ndarray,
    weights: List[np.ndarray],
    biases: List[np.ndarray],
    activation_derivative: callable
) -> Tuple[np.ndarray, List[np.ndarray]]:
    """
    Compute full network Jacobian ∂a^(L)/∂x.
    
    This explicitly constructs the Jacobian for demonstration.
    In practice, we use backpropagation instead.
    
    Returns:
        full_jacobian: ∂a^(L)/∂x of shape (n_L, n_0)
        layer_jacobians: List of per-layer Jacobians
    """
    # Forward pass to get z values
    a = x
    z_values = []
    for W, b in zip(weights, biases):
        z = W @ a + b
        z_values.append(z)
        a = relu(z)
    
    # Compute layer Jacobians
    layer_jacobians = []
    for l in range(len(weights)):
        J_l = compute_layer_jacobian(weights[l], z_values[l], activation_derivative)
        layer_jacobians.append(J_l)
    
    # Chain together: J = J_L @ J_{L-1} @ ... @ J_1
    full_jacobian = layer_jacobians[-1]
    for l in range(len(weights) - 2, -1, -1):
        full_jacobian = full_jacobian @ layer_jacobians[l]
    
    return full_jacobian, layer_jacobians
 
 
def verify_jacobian_numerically(
    x: np.ndarray,
    weights: List[np.ndarray],
    biases: List[np.ndarray],
    eps: float = 1e-5
) -> np.ndarray:
    """
    Compute Jacobian numerically via finite differences.
    Used to verify analytical Jacobian computation.
    """
    def forward(x):
        a = x
        for W, b in zip(weights, biases):
            z = W @ a + b
            a = relu(z)
        return a
    
    n_in = x.shape[0]
    n_out = forward(x).shape[0]
    
    J_numerical = np.zeros((n_out, n_in))
    
    for i in range(n_in):
        x_plus = x.copy()
        x_plus[i] += eps
        x_minus = x.copy()
        x_minus[i] -= eps
        
        J_numerical[:, i] = (forward(x_plus) - forward(x_minus)) / (2 * eps)
    
    return J_numerical
 
 
def analyze_jacobian_properties():
    """
    Analyze Jacobian to understand gradient flow.
    """
    np.random.seed(42)
    
    # Simple network: 4 -> 8 -> 6 -> 3
    weights = [
        np.random.randn(8, 4) * 0.5,
        np.random.randn(6, 8) * 0.5,
        np.random.randn(3, 6) * 0.5,
    ]
    biases = [np.zeros(8), np.zeros(6), np.zeros(3)]
    
    x = np.random.randn(4)
    
    # Compute analytical Jacobian
    J_analytical, layer_Js = compute_full_jacobian(
        x, weights, biases, relu_derivative
    )
    
    # Compute numerical Jacobian
    J_numerical = verify_jacobian_numerically(x, weights, biases)
    
    print("Jacobian Analysis")
    print("=" * 50)
    print(f"Network: 4 → 8 → 6 → 3")
    print(f"Full Jacobian shape: {J_analytical.shape}")
    
    print("\nLayer Jacobians:")
    for l, J in enumerate(layer_Js):
        print(f"  Layer {l+1}: shape {J.shape}, "
              f"rank={np.linalg.matrix_rank(J)}, "
              f"norm={np.linalg.norm(J):.4f}")
    
    # Verify correctness
    max_diff = np.max(np.abs(J_analytical - J_numerical))
    print(f"\nMax diff (analytical vs numerical): {max_diff:.2e}")
    
    # Analyze singular values for gradient flow
    U, S, Vh = np.linalg.svd(J_analytical)
    print(f"\nFull Jacobian singular values: {S}")
    print(f"Condition number: {S.max() / S.min():.2f}")
    print("(Large condition number → gradient magnification/shrinkage)")
 
 
if __name__ == "__main__":
    analyze_jacobian_properties()

Jacobian Size Explosion

Linear Algebra Optimizations

Efficient neural network implementation leverages decades of linear algebra optimization. Understanding these optimizations explains performance characteristics.

BLAS Levels:

BLAS (Basic Linear Algebra Subprograms) defines three levels of operations:

Level	Operation	FLOPs	Memory	Arithmetic Intensity
1	Vector-vector (dot product)	O(n)	O(n)	O(1)
2	Matrix-vector	O(n²)	O(n²)	O(1)
3	Matrix-matrix	O(n³)	O(n²)	O(n)

Arithmetic intensity = FLOPs / Memory accesses

Higher intensity = better hardware utilization.

Why Batch Processing Wins:

Single-sample: Level-2 BLAS (matrix-vector) Batch processing: Level-3 BLAS (matrix-matrix)

Level-3 operations can reuse data in fast cache, reducing memory bandwidth bottleneck.

The Memory Hierarchy:

Level	Size	Latency	Bandwidth
Registers	KB	0 cycles	N/A
L1 Cache	32-64 KB	4 cycles	~1 TB/s
L2 Cache	256 KB	~12 cycles	~500 GB/s
L3 Cache	8-32 MB	~40 cycles	~200 GB/s
Main Memory	GB	~200 cycles	~50 GB/s
GPU Global Memory	GB	~400 cycles	~900 GB/s

Matrix-matrix multiplication algorithms (like Strassen, or tiled implementations) exploit cache hierarchy to minimize slow memory access.

Key Optimizations in Practice

•Tiled/Blocked Matrix Multiplication: Divide matrices into tiles that fit in cache: compute partial products tile-by-tile, only loading each tile from memory once.
•Operation Fusion: Combine matmul + bias + activation into single kernel launch, avoiding intermediate memory writes. Modern frameworks (TensorRT, XLA) do this automatically.
•Width Alignment: Layer widths as multiples of 32/64 align with GPU warp size and SIMD registers, maximizing throughput.
•Tensor Cores (NVIDIA): Specialized hardware for 4×4 matrix operations at mixed precision. Requires specific matrix dimensions (multiples of 8/16) to utilize.
•Memory Layout: Row-major vs column-major storage affects cache efficiency. Frameworks handle this, but awareness helps when debugging performance.

matrix_optimization_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
import numpy as np
import time
 
def benchmark_matrix_operations():
    """
    Compare performance of different operation patterns.
    Demonstrates why batching and matrix ops are preferred.
    """
    np.random.seed(42)
    n = 512
    k = 256
    B = 128  # Batch size
    
    # Matrices
    A = np.random.randn(B, n).astype(np.float32)  # Batch of vectors
    W = np.random.randn(k, n).astype(np.float32)  # Weight matrix
    b = np.random.randn(k).astype(np.float32)     # Bias
    
    n_trials = 100
    
    print("Matrix Operation Benchmarks")
    print("=" * 50)
    print(f"Batch size: {B}, Input dim: {n}, Output dim: {k}")
    print("-" * 50)
    
    # Method 1: Loop over samples (Level-2 BLAS)
    start = time.time()
    for _ in range(n_trials):
        results = np.zeros((B, k), dtype=np.float32)
        for i in range(B):
            results[i] = W @ A[i] + b
    loop_time = time.time() - start
    print(f"Loop (Level-2): {1000*loop_time/n_trials:.3f} ms/batch")
    
    # Method 2: Batch matrix multiply (Level-3 BLAS)
    start = time.time()
    for _ in range(n_trials):
        results = A @ W.T + b
    batch_time = time.time() - start
    print(f"Batch (Level-3): {1000*batch_time/n_trials:.3f} ms/batch")
    
    speedup = loop_time / batch_time
    print(f"\nSpeedup: {speedup:.1f}x")
    
    # Method 3: Show FLOP analysis
    flops_per_batch = 2 * B * k * n + B * k  # matmul + bias
    gflops_batch = (flops_per_batch * n_trials) / (batch_time * 1e9)
    print(f"\nBatch throughput: {gflops_batch:.2f} GFLOPS")
 
 
def memory_access_analysis():
    """
    Analyze memory access patterns in matrix operations.
    """
    print("\nMemory Access Analysis")
    print("=" * 50)
    
    # For Z = A @ W.T + b
    # Where A: (B, n), W: (k, n), b: (k,)
    B, n, k = 128, 512, 256
    
    print(f"Operation: Z = A @ W.T + b")
    print(f"A: ({B}, {n}), W: ({k}, {n}), b: ({k},)")
    print()
    
    # Memory reads
    A_bytes = B * n * 4  # float32
    W_bytes = k * n * 4
    b_bytes = k * 4
    total_read = A_bytes + W_bytes + b_bytes
    
    # Memory writes
    Z_bytes = B * k * 4
    
    # FLOPs
    flops = 2 * B * k * n + B * k  # matmul + bias
    
    # Arithmetic intensity
    intensity = flops / (total_read + Z_bytes)
    
    print(f"Memory reads: {total_read / 1024:.1f} KB")
    print(f"  A: {A_bytes / 1024:.1f} KB")
    print(f"  W: {W_bytes / 1024:.1f} KB")  
    print(f"  b: {b_bytes / 1024:.2f} KB")
    print(f"Memory writes: {Z_bytes / 1024:.1f} KB")
    print(f"Total memory: {(total_read + Z_bytes) / 1024:.1f} KB")
    print(f"\nFLOPs: {flops:,}")
    print(f"Arithmetic Intensity: {intensity:.1f} FLOPs/byte")
    
    # Compare with single sample (B=1)
    B_single = 1
    A_bytes_s = B_single * n * 4
    Z_bytes_s = B_single * k * 4
    flops_s = 2 * B_single * k * n + B_single * k
    total_mem_s = A_bytes_s + W_bytes + b_bytes + Z_bytes_s
    intensity_s = flops_s / total_mem_s
    
    print(f"\nSingle sample intensity: {intensity_s:.2f} FLOPs/byte")
    print(f"Batch intensity improvement: {intensity / intensity_s:.1f}x")
 
 
if __name__ == "__main__":
    benchmark_matrix_operations()
    memory_access_analysis()

The Roofline Model

Implementation Patterns and Best Practices

Translating matrix formulations into efficient code requires attention to several practical concerns.

Consistent Shape Handling:

Always work with 2D tensors for input: $(B, n)$ even for $B=1$
This avoids special-casing and maintains consistent code paths
Use reshape or unsqueeze to add batch dimension

In-Place Operations:

Avoid unnecessary tensor allocations
In-place activations save memory: torch.relu_(x)
Be careful: in-place breaks gradient computation if done incorrectly

Fused Operations:

Many frameworks provide fused operations
F.linear(x, W, b): matmul + bias in one operation
F.relu(x, inplace=True): saves memory

Data Types:

float32 is the default; sufficient for most training
float16 (half precision): 2x memory savings, faster on tensor cores
bfloat16: better numerical properties than float16
int8/int4: for inference quantization

implementation_best_practices.py
PyTorch
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class EfficientMLP(nn.Module):
    """
    MLP implementation following best practices for efficiency.
    
    Key practices:
    - Fused operations where possible
    - Proper initialization
    - Optional mixed precision support
    - Efficient forward pass implementation
    """
    
    def __init__(
        self,
        in_features: int,
        hidden_features: list,
        out_features: int,
        activation: str = "relu",
        use_bias: bool = True,
    ):
        super().__init__()
        
        all_sizes = [in_features] + hidden_features + [out_features]
        
        # Build layers
        self.layers = nn.ModuleList()
        for i in range(len(all_sizes) - 1):
            self.layers.append(
                nn.Linear(all_sizes[i], all_sizes[i+1], bias=use_bias)
            )
        
        # Store activation choice
        self.activation = activation
        self.num_hidden = len(hidden_features)
        
        # Initialize weights
        self._init_weights()
    
    def _init_weights(self):
        """Apply He initialization for ReLU networks."""
        for module in self.layers:
            if isinstance(module, nn.Linear):
                nn.init.kaiming_normal_(module.weight, nonlinearity='relu')
                if module.bias is not None:
                    nn.init.zeros_(module.bias)
    
    def _get_activation(self, x: torch.Tensor, inplace: bool = True) -> torch.Tensor:
        """Apply activation function efficiently."""
        if self.activation == "relu":
            return F.relu(x, inplace=inplace)
        elif self.activation == "gelu":
            return F.gelu(x)
        elif self.activation == "silu":
            return F.silu(x, inplace=inplace)
        else:
            raise ValueError(f"Unknown activation: {self.activation}")
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Efficient forward pass.
        
        Uses in-place operations for activations where safe.
        """
        # Ensure 2D input
        if x.dim() == 1:
            x = x.unsqueeze(0)
        
        # Hidden layers with activation
        for layer in self.layers[:-1]:
            # F.linear is a fused matmul + bias
            x = self._get_activation(layer(x), inplace=True)
        
        # Output layer (no activation)
        x = self.layers[-1](x)
        
        return x
 
 
class NumericallyStableOutput(nn.Module):
    """
    Numerically stable output layer for classification.
    
    Combines log_softmax with the loss for stable training.
    For inference, can use softmax normally.
    """
    
    def __init__(self, in_features: int, num_classes: int):
        super().__init__()
        self.linear = nn.Linear(in_features, num_classes)
    
    def forward(self, x: torch.Tensor, return_logits: bool = True) -> torch.Tensor:
        """
        Forward pass returning logits or log-probabilities.
        
        Args:
            x: Input features
            return_logits: If True, return raw logits (for use with CrossEntropyLoss)
                          If False, return log_softmax (for NLLLoss or analysis)
        """
        logits = self.linear(x)
        
        if return_logits:
            # CrossEntropyLoss combines log_softmax + nll internally
            return logits
        else:
            # Numerically stable log-softmax
            return F.log_softmax(logits, dim=-1)
    
    def predict_proba(self, x: torch.Tensor) -> torch.Tensor:
        """Return probabilities for inference."""
        with torch.no_grad():
            logits = self.linear(x)
            return F.softmax(logits, dim=-1)
 
 
# Demonstration
def demonstrate_efficient_implementation():
    """Show efficient MLP in action."""
    
    # Create model
    model = EfficientMLP(
        in_features=784,
        hidden_features=[512, 256, 128],
        out_features=10,
        activation="relu"
    )
    
    # Count parameters
    total_params = sum(p.numel() for p in model.parameters())
    print(f"Model parameters: {total_params:,}")
    
    # Test forward pass
    batch = torch.randn(64, 784)
    
    # Warm up (compilation, caching)
    _ = model(batch)
    
    # Benchmark
    import time
    n_iterations = 100
    
    start = time.time()
    for _ in range(n_iterations):
        output = model(batch)
    elapsed = time.time() - start
    
    print(f"Forward pass: {1000 * elapsed / n_iterations:.3f} ms/batch")
    print(f"Throughput: {64 * n_iterations / elapsed:.0f} samples/sec")
 
 
if __name__ == "__main__":
    demonstrate_efficient_implementation()

Summary: Matrix Formulation Mastery

Key Concepts Mastered

•Notation conventions: Consistent indexing (samples as rows, W ∈ ℝⁿˡ×ⁿˡ⁻¹) enables clear communication and correct implementation.
•Single-sample form: z = Wa + b captures the affine transformation; a = σ(z) the nonlinearity.
•Batch form: Z = AW^T + b processes B samples simultaneously via matrix-matrix multiplication.
•Jacobian structure: ∂a^(L)/∂x = ∏(Dₗ Wₗ) reveals gradient flow; diagonal Dₗ from element-wise activations.
•BLAS optimization: Level-3 operations (matrix-matrix) maximize hardware efficiency through cache reuse.
•Implementation patterns: Fused operations, in-place activations, proper initialization, and consistent shape handling.

What's Next:

Page Complete

4 / 5