Loading learning content...
Matrix formulation isn't merely a notational convenience—it's the key to understanding and implementing neural networks efficiently. Every modern deep learning framework represents neural network computation as matrix operations, and this representation enables:
This page develops the complete matrix formulation for MLPs, from single-sample to batch computation, with attention to the conventions and indexing that make implementation straightforward.
By the end of this page, you will understand: (1) The complete matrix notation for MLP computation; (2) How batch operations are formulated as matrix-matrix products; (3) The Jacobian structure for gradient analysis; (4) Dimensionality conventions and their implications; (5) Implementation patterns for vectorized code.
Precise notation prevents confusion when implementing neural networks. Different sources use different conventions; we establish ours clearly.
Scalar Notation:
Vector Notation (single sample):
Matrix Notation:
Batch Notation:
| Object | Shape (Single) | Shape (Batch B) | Description |
|---|---|---|---|
| Input | (n₀,) | (B, n₀) | Raw features |
| Weights W⁽ˡ⁾ | (nₗ, nₗ₋₁) | (nₗ, nₗ₋₁) | Layer l weights |
| Biases b⁽ˡ⁾ | (nₗ,) | (nₗ,) | Layer l biases |
| Pre-activation z⁽ˡ⁾ | (nₗ,) | (B, nₗ) | Before activation |
| Activation a⁽ˡ⁾ | (nₗ,) | (B, nₗ) | After activation |
| Output | (nₗ,) | (B, nₗ) | Final prediction |
Some texts use the opposite convention: samples as columns (X ∈ ℝⁿ⁰×ᴮ) and weights as W ∈ ℝⁿˡ⁻¹×ⁿˡ. Our convention (samples as rows) matches common framework implementations and makes broadcasting more intuitive. Always verify the convention when reading papers or adapting code.
For a single input sample, forward propagation is a sequence of matrix-vector operations.
Layer-wise Computation:
Given input $\mathbf{a}^{(0)} = \mathbf{x}$, for each layer $l = 1, \ldots, L$:
$$\mathbf{z}^{(l)} = W^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}$$
$$\mathbf{a}^{(l)} = \sigma^{(l)}(\mathbf{z}^{(l)})$$
Component Form:
For unit $j$ in layer $l$:
$$z_j^{(l)} = \sum_{i=1}^{n_{l-1}} W_{ji}^{(l)} a_i^{(l-1)} + b_j^{(l)}$$
$$a_j^{(l)} = \sigma^{(l)}(z_j^{(l)})$$
The matrix form compactly represents all $n_l$ of these computations simultaneously.
Complete Network Function:
$$f(\mathbf{x}; \Theta) = (\sigma^{(L)} \circ g^{(L)} \circ \sigma^{(L-1)} \circ g^{(L-1)} \circ \cdots \circ \sigma^{(1)} \circ g^{(1)})(\mathbf{x})$$
where $g^{(l)}(\mathbf{a}) = W^{(l)}\mathbf{a} + \mathbf{b}^{(l)}$ is the affine transformation.
Dimensional Analysis:
Consider a network $(784, 256, 128, 10)$:
Layer 1: $\mathbf{z}^{(1)} = \underbrace{W^{(1)}}{256 \times 784} \underbrace{\mathbf{x}}{784 \times 1} + \underbrace{\mathbf{b}^{(1)}}{256 \times 1} \Rightarrow \underbrace{\mathbf{z}^{(1)}}{256 \times 1}$
Layer 2: $\mathbf{z}^{(2)} = \underbrace{W^{(2)}}{128 \times 256} \underbrace{\mathbf{a}^{(1)}}{256 \times 1} + \underbrace{\mathbf{b}^{(2)}}{128 \times 1} \Rightarrow \underbrace{\mathbf{z}^{(2)}}{128 \times 1}$
Layer 3: $\mathbf{z}^{(3)} = \underbrace{W^{(3)}}{10 \times 128} \underbrace{\mathbf{a}^{(2)}}{128 \times 1} + \underbrace{\mathbf{b}^{(3)}}{10 \times 1} \Rightarrow \underbrace{\mathbf{z}^{(3)}}{10 \times 1}$
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586
import numpy as npfrom typing import List def single_sample_forward( x: np.ndarray, # Shape: (n_input,) weights: List[np.ndarray], # Each W: (n_out, n_in) biases: List[np.ndarray], # Each b: (n_out,) activations: List[callable]) -> np.ndarray: """ Forward propagation for a single sample using matrix-vector operations. This implementation follows the mathematical notation exactly, explicitly showing the matrix-vector multiplications. Args: x: Input vector of shape (n_input,) weights: List of weight matrices W^(l), each shape (n_out, n_in) biases: List of bias vectors b^(l), each shape (n_out,) activations: List of activation functions Returns: output: Network output vector """ a = x # a^(0) = x for l, (W, b, sigma) in enumerate(zip(weights, biases, activations)): # z^(l) = W^(l) @ a^(l-1) + b^(l) # Matrix-vector multiplication: (n_l, n_{l-1}) @ (n_{l-1},) -> (n_l,) z = W @ a + b # a^(l) = σ^(l)(z^(l)) a = sigma(z) # Debug output (optional) print(f"Layer {l+1}: W{W.shape} @ a{a.shape[0] if l == 0 else len(x)} + b{b.shape} -> z{z.shape} -> a{a.shape}") return a def verify_component_equivalence(): """ Verify that matrix formulation equals component-wise computation. This demonstrates the equivalence of the two perspectives. """ np.random.seed(42) # Simple network: 3 -> 4 -> 2 W1 = np.random.randn(4, 3) b1 = np.random.randn(4) W2 = np.random.randn(2, 4) b2 = np.random.randn(2) x = np.random.randn(3) relu = lambda z: np.maximum(0, z) # Matrix form z1_matrix = W1 @ x + b1 a1_matrix = relu(z1_matrix) z2_matrix = W2 @ a1_matrix + b2 a2_matrix = relu(z2_matrix) # Component form z1_component = np.zeros(4) for j in range(4): z1_component[j] = sum(W1[j, i] * x[i] for i in range(3)) + b1[j] a1_component = relu(z1_component) z2_component = np.zeros(2) for j in range(2): z2_component[j] = sum(W2[j, i] * a1_component[i] for i in range(4)) + b2[j] a2_component = relu(z2_component) # Compare print("Matrix vs Component Equivalence") print("=" * 40) print(f"z1 max diff: {np.max(np.abs(z1_matrix - z1_component)):.2e}") print(f"a1 max diff: {np.max(np.abs(a1_matrix - a1_component)):.2e}") print(f"z2 max diff: {np.max(np.abs(z2_matrix - z2_component)):.2e}") print(f"a2 max diff: {np.max(np.abs(a2_matrix - a2_component)):.2e}") print("\n✓ Matrix and component forms are mathematically equivalent") if __name__ == "__main__": verify_component_equivalence()Batch computation extends the single-sample formulation to process $B$ samples simultaneously using matrix-matrix operations.
Batch Representation:
Stack $B$ input samples as rows: $$X = \begin{bmatrix} — \mathbf{x}^{(1)} — \ — \mathbf{x}^{(2)} — \ \vdots \ — \mathbf{x}^{(B)} — \end{bmatrix} \in \mathbb{R}^{B \times n_0}$$
Batch Forward Computation:
$$Z^{(l)} = A^{(l-1)} W^{(l)\top} + \mathbf{1}_B \mathbf{b}^{(l)\top}$$
$$A^{(l)} = \sigma^{(l)}(Z^{(l)})$$
With broadcasting (as implemented in NumPy/PyTorch):
$$Z^{(l)} = A^{(l-1)} W^{(l)\top} + \mathbf{b}^{(l)}$$
The bias vector $\mathbf{b}^{(l)} \in \mathbb{R}^{n_l}$ broadcasts to all $B$ rows.
Dimensional Analysis:
Consider network $(784, 256, 10)$ with batch $B = 32$:
Layer 1: $$Z^{(1)} = \underbrace{A^{(0)}}{32 \times 784} \underbrace{W^{(1)\top}}{784 \times 256} + \underbrace{\mathbf{b}^{(1)}}{256} \Rightarrow \underbrace{Z^{(1)}}{32 \times 256}$$
Layer 2: $$Z^{(2)} = \underbrace{A^{(1)}}{32 \times 256} \underbrace{W^{(2)\top}}{256 \times 10} + \underbrace{\mathbf{b}^{(2)}}{10} \Rightarrow \underbrace{Z^{(2)}}{32 \times 10}$$
Alternative Convention (Samples as Columns):
Some texts use: $$Z^{(l)} = W^{(l)} A^{(l-1)} + \mathbf{b}^{(l)} \mathbf{1}_B^\top$$
where $A^{(l)} \in \mathbb{R}^{n_l \times B}$ has samples as columns. This avoids the transpose on $W$ but changes broadcasting semantics.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136
import numpy as npfrom typing import List, Tuple def batch_forward_explicit( X: np.ndarray, # Shape: (B, n_input) weights: List[np.ndarray], # Each W: (n_out, n_in) biases: List[np.ndarray], # Each b: (n_out,) activations: List[callable]) -> Tuple[np.ndarray, List[np.ndarray]]: """ Batch forward propagation with explicit matrix operations. This implementation clearly shows the matrix-matrix multiplications and broadcasting for batch processing. Args: X: Input batch of shape (B, n_input) weights: List of weight matrices W^(l), each shape (n_out, n_in) biases: List of bias vectors b^(l), each shape (n_out,) activations: List of activation functions Returns: output: Network output of shape (B, n_output) all_activations: All layer activations for backprop """ B = X.shape[0] A = X # A^(0) = X, shape (B, n_0) all_activations = [A] for l, (W, b, sigma) in enumerate(zip(weights, biases, activations)): # Z^(l) = A^(l-1) @ W^(l).T + b^(l) # Shape: (B, n_{l-1}) @ (n_{l-1}, n_l) + (n_l,) -> (B, n_l) # The bias b broadcasts from (n_l,) to (B, n_l) Z = A @ W.T + b # A^(l) = σ^(l)(Z^(l)) A = sigma(Z) all_activations.append(A) print(f"Layer {l+1}:") print(f" A^({l}) @ W^({l+1}).T + b^({l+1})") print(f" ({A.shape[0]}×{W.shape[1]}) @ ({W.shape[1]}×{W.shape[0]}) + ({b.shape[0]},)") print(f" -> Z: {Z.shape}, A: {A.shape}") return A, all_activations def demonstrate_broadcasting(): """ Explain and demonstrate NumPy/PyTorch broadcasting in batch computation. """ print("Broadcasting in Batch Computation") print("=" * 50) B = 4 # Batch size n_in = 3 n_out = 5 # Batch of activations A = np.random.randn(B, n_in) print(f"Input A: shape {A.shape}") print(A) # Weight matrix W = np.random.randn(n_out, n_in) print(f"\nWeight W: shape {W.shape}") # Bias vector b = np.array([1.0, 2.0, 3.0, 4.0, 5.0]) print(f"Bias b: shape {b.shape}") print(f" b = {b}") # Matrix multiplication Z_no_bias = A @ W.T print(f"\nA @ W.T: shape {Z_no_bias.shape}") # Broadcasting: b is added to each row Z = Z_no_bias + b print(f"(A @ W.T) + b: shape {Z.shape}") print("\nBroadcasting adds b to each row:") for i in range(B): print(f" Row {i}: Z_no_bias[{i}] + b = {Z_no_bias[i]} + {b}") print(f" = {Z[i]}") # Verify equivalence to explicit addition Z_explicit = np.zeros((B, n_out)) for i in range(B): Z_explicit[i] = Z_no_bias[i] + b print(f"\nMax diff (broadcast vs explicit): {np.max(np.abs(Z - Z_explicit))}") def flops_analysis(): """ Analyze FLOPs for batch matrix operations. """ print("\nFLOPs Analysis for Batch Computation") print("=" * 50) # Network: 784 -> 512 -> 256 -> 10 architecture = [784, 512, 256, 10] batch_size = 32 total_flops = 0 for l in range(len(architecture) - 1): n_in = architecture[l] n_out = architecture[l + 1] # Matrix multiply: (B, n_in) @ (n_in, n_out) # FLOPs: B * n_out * n_in * 2 (multiply-add) mm_flops = 2 * batch_size * n_out * n_in # Bias addition: B * n_out bias_flops = batch_size * n_out # Activation (ReLU): B * n_out comparisons act_flops = batch_size * n_out layer_flops = mm_flops + bias_flops + act_flops total_flops += layer_flops print(f"Layer {l+1} ({n_in}→{n_out}):") print(f" MatMul: {mm_flops:,} FLOPs ({100*mm_flops/layer_flops:.1f}%)") print(f" Bias: {bias_flops:,} FLOPs") print(f" Act: {act_flops:,} FLOPs") print(f" Total: {layer_flops:,} FLOPs") print(f"\nTotal network: {total_flops:,} FLOPs") print(f"Per sample: {total_flops // batch_size:,} FLOPs") if __name__ == "__main__": demonstrate_broadcasting() flops_analysis()Using Z = A @ W.T (with samples as rows) means we store W with output neurons as rows—the same orientation used in the single-sample formula z = Wx. This consistency simplifies code. PyTorch's nn.Linear uses this convention: W has shape (out_features, in_features), matching our W^(l) ∈ ℝⁿˡ×ⁿˡ⁻¹.
Understanding the Jacobian (matrix of partial derivatives) of network operations is essential for analyzing gradient flow and understanding backpropagation.
Jacobian Definition:
For a function $f: \mathbb{R}^n \to \mathbb{R}^m$, the Jacobian $J_f \in \mathbb{R}^{m \times n}$ has entries:
$$(J_f)_{ij} = \frac{\partial f_i}{\partial x_j}$$
Jacobian of Affine Transformation:
For $g(\mathbf{a}) = W\mathbf{a} + \mathbf{b}$:
$$J_g = \frac{\partial (W\mathbf{a} + \mathbf{b})}{\partial \mathbf{a}} = W$$
The Jacobian of a linear transformation is simply the weight matrix itself.
Jacobian of Element-wise Activation:
For $\sigma(\mathbf{z})$ applied element-wise:
$$J_\sigma = \text{diag}(\sigma'(z_1), \sigma'(z_2), \ldots, \sigma'(z_n))$$
A diagonal matrix with activation derivatives on the diagonal.
Chain Rule for Composed Functions:
For composition $h = f \circ g$:
$$J_h = J_f \cdot J_g$$
Jacobians multiply in the order of function application.
Full Forward Pass Jacobian:
For the network function $f(\mathbf{x}) = \mathbf{a}^{(L)}$, the Jacobian $\frac{\partial \mathbf{a}^{(L)}}{\partial \mathbf{x}}$ is:
$$\frac{\partial \mathbf{a}^{(L)}}{\partial \mathbf{x}} = \left(\prod_{l=L}^{1} D^{(l)} W^{(l)}\right)$$
where $D^{(l)} = \text{diag}(\sigma'^{(l)}(\mathbf{z}^{(l)}))$ is the diagonal matrix of activation derivatives at layer $l$.
Expanded:
$$\frac{\partial \mathbf{a}^{(L)}}{\partial \mathbf{x}} = D^{(L)} W^{(L)} D^{(L-1)} W^{(L-1)} \cdots D^{(1)} W^{(1)}$$
Gradient of Loss w.r.t. Input:
If $\mathcal{L} = \ell(\mathbf{a}^{(L)})$ for some scalar loss $\ell$:
$$\frac{\partial \mathcal{L}}{\partial \mathbf{x}} = \left(\frac{\partial \mathbf{a}^{(L)}}{\partial \mathbf{x}}\right)^\top \frac{\partial \mathcal{L}}{\partial \mathbf{a}^{(L)}}$$
This is exactly what backpropagation computes, layer by layer, without forming the full Jacobian.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154
import numpy as npfrom typing import List, Tuple def relu(z): return np.maximum(0, z) def relu_derivative(z): """Derivative of ReLU: 1 if z > 0, else 0.""" return (z > 0).astype(float) def compute_layer_jacobian( W: np.ndarray, z: np.ndarray, activation_derivative: callable) -> np.ndarray: """ Compute the Jacobian of a single layer: J = D @ W Args: W: Weight matrix, shape (n_out, n_in) z: Pre-activation vector, shape (n_out,) activation_derivative: σ'(z) function Returns: Jacobian of shape (n_out, n_in) """ # D is diagonal matrix of activation derivatives deriv = activation_derivative(z) # Shape (n_out,) D = np.diag(deriv) # Shape (n_out, n_out) # J = D @ W J = D @ W return J def compute_full_jacobian( x: np.ndarray, weights: List[np.ndarray], biases: List[np.ndarray], activation_derivative: callable) -> Tuple[np.ndarray, List[np.ndarray]]: """ Compute full network Jacobian ∂a^(L)/∂x. This explicitly constructs the Jacobian for demonstration. In practice, we use backpropagation instead. Returns: full_jacobian: ∂a^(L)/∂x of shape (n_L, n_0) layer_jacobians: List of per-layer Jacobians """ # Forward pass to get z values a = x z_values = [] for W, b in zip(weights, biases): z = W @ a + b z_values.append(z) a = relu(z) # Compute layer Jacobians layer_jacobians = [] for l in range(len(weights)): J_l = compute_layer_jacobian(weights[l], z_values[l], activation_derivative) layer_jacobians.append(J_l) # Chain together: J = J_L @ J_{L-1} @ ... @ J_1 full_jacobian = layer_jacobians[-1] for l in range(len(weights) - 2, -1, -1): full_jacobian = full_jacobian @ layer_jacobians[l] return full_jacobian, layer_jacobians def verify_jacobian_numerically( x: np.ndarray, weights: List[np.ndarray], biases: List[np.ndarray], eps: float = 1e-5) -> np.ndarray: """ Compute Jacobian numerically via finite differences. Used to verify analytical Jacobian computation. """ def forward(x): a = x for W, b in zip(weights, biases): z = W @ a + b a = relu(z) return a n_in = x.shape[0] n_out = forward(x).shape[0] J_numerical = np.zeros((n_out, n_in)) for i in range(n_in): x_plus = x.copy() x_plus[i] += eps x_minus = x.copy() x_minus[i] -= eps J_numerical[:, i] = (forward(x_plus) - forward(x_minus)) / (2 * eps) return J_numerical def analyze_jacobian_properties(): """ Analyze Jacobian to understand gradient flow. """ np.random.seed(42) # Simple network: 4 -> 8 -> 6 -> 3 weights = [ np.random.randn(8, 4) * 0.5, np.random.randn(6, 8) * 0.5, np.random.randn(3, 6) * 0.5, ] biases = [np.zeros(8), np.zeros(6), np.zeros(3)] x = np.random.randn(4) # Compute analytical Jacobian J_analytical, layer_Js = compute_full_jacobian( x, weights, biases, relu_derivative ) # Compute numerical Jacobian J_numerical = verify_jacobian_numerically(x, weights, biases) print("Jacobian Analysis") print("=" * 50) print(f"Network: 4 → 8 → 6 → 3") print(f"Full Jacobian shape: {J_analytical.shape}") print("\nLayer Jacobians:") for l, J in enumerate(layer_Js): print(f" Layer {l+1}: shape {J.shape}, " f"rank={np.linalg.matrix_rank(J)}, " f"norm={np.linalg.norm(J):.4f}") # Verify correctness max_diff = np.max(np.abs(J_analytical - J_numerical)) print(f"\nMax diff (analytical vs numerical): {max_diff:.2e}") # Analyze singular values for gradient flow U, S, Vh = np.linalg.svd(J_analytical) print(f"\nFull Jacobian singular values: {S}") print(f"Condition number: {S.max() / S.min():.2f}") print("(Large condition number → gradient magnification/shrinkage)") if __name__ == "__main__": analyze_jacobian_properties()Explicitly constructing the full Jacobian is O(n_L × n_0) in space and expensive to compute. For realistic networks, this is infeasible. Backpropagation instead computes the product Jᵀv (Jacobian-vector product) efficiently in O(parameters) time, without ever forming J explicitly. This is why we study Jacobian structure for analysis but use backpropagation for computation.
Efficient neural network implementation leverages decades of linear algebra optimization. Understanding these optimizations explains performance characteristics.
BLAS Levels:
BLAS (Basic Linear Algebra Subprograms) defines three levels of operations:
| Level | Operation | FLOPs | Memory | Arithmetic Intensity |
|---|---|---|---|---|
| 1 | Vector-vector (dot product) | O(n) | O(n) | O(1) |
| 2 | Matrix-vector | O(n²) | O(n²) | O(1) |
| 3 | Matrix-matrix | O(n³) | O(n²) | O(n) |
Arithmetic intensity = FLOPs / Memory accesses
Higher intensity = better hardware utilization.
Why Batch Processing Wins:
Single-sample: Level-2 BLAS (matrix-vector) Batch processing: Level-3 BLAS (matrix-matrix)
Level-3 operations can reuse data in fast cache, reducing memory bandwidth bottleneck.
The Memory Hierarchy:
| Level | Size | Latency | Bandwidth |
|---|---|---|---|
| Registers | KB | 0 cycles | N/A |
| L1 Cache | 32-64 KB | 4 cycles | ~1 TB/s |
| L2 Cache | 256 KB | ~12 cycles | ~500 GB/s |
| L3 Cache | 8-32 MB | ~40 cycles | ~200 GB/s |
| Main Memory | GB | ~200 cycles | ~50 GB/s |
| GPU Global Memory | GB | ~400 cycles | ~900 GB/s |
Matrix-matrix multiplication algorithms (like Strassen, or tiled implementations) exploit cache hierarchy to minimize slow memory access.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104
import numpy as npimport time def benchmark_matrix_operations(): """ Compare performance of different operation patterns. Demonstrates why batching and matrix ops are preferred. """ np.random.seed(42) n = 512 k = 256 B = 128 # Batch size # Matrices A = np.random.randn(B, n).astype(np.float32) # Batch of vectors W = np.random.randn(k, n).astype(np.float32) # Weight matrix b = np.random.randn(k).astype(np.float32) # Bias n_trials = 100 print("Matrix Operation Benchmarks") print("=" * 50) print(f"Batch size: {B}, Input dim: {n}, Output dim: {k}") print("-" * 50) # Method 1: Loop over samples (Level-2 BLAS) start = time.time() for _ in range(n_trials): results = np.zeros((B, k), dtype=np.float32) for i in range(B): results[i] = W @ A[i] + b loop_time = time.time() - start print(f"Loop (Level-2): {1000*loop_time/n_trials:.3f} ms/batch") # Method 2: Batch matrix multiply (Level-3 BLAS) start = time.time() for _ in range(n_trials): results = A @ W.T + b batch_time = time.time() - start print(f"Batch (Level-3): {1000*batch_time/n_trials:.3f} ms/batch") speedup = loop_time / batch_time print(f"\nSpeedup: {speedup:.1f}x") # Method 3: Show FLOP analysis flops_per_batch = 2 * B * k * n + B * k # matmul + bias gflops_batch = (flops_per_batch * n_trials) / (batch_time * 1e9) print(f"\nBatch throughput: {gflops_batch:.2f} GFLOPS") def memory_access_analysis(): """ Analyze memory access patterns in matrix operations. """ print("\nMemory Access Analysis") print("=" * 50) # For Z = A @ W.T + b # Where A: (B, n), W: (k, n), b: (k,) B, n, k = 128, 512, 256 print(f"Operation: Z = A @ W.T + b") print(f"A: ({B}, {n}), W: ({k}, {n}), b: ({k},)") print() # Memory reads A_bytes = B * n * 4 # float32 W_bytes = k * n * 4 b_bytes = k * 4 total_read = A_bytes + W_bytes + b_bytes # Memory writes Z_bytes = B * k * 4 # FLOPs flops = 2 * B * k * n + B * k # matmul + bias # Arithmetic intensity intensity = flops / (total_read + Z_bytes) print(f"Memory reads: {total_read / 1024:.1f} KB") print(f" A: {A_bytes / 1024:.1f} KB") print(f" W: {W_bytes / 1024:.1f} KB") print(f" b: {b_bytes / 1024:.2f} KB") print(f"Memory writes: {Z_bytes / 1024:.1f} KB") print(f"Total memory: {(total_read + Z_bytes) / 1024:.1f} KB") print(f"\nFLOPs: {flops:,}") print(f"Arithmetic Intensity: {intensity:.1f} FLOPs/byte") # Compare with single sample (B=1) B_single = 1 A_bytes_s = B_single * n * 4 Z_bytes_s = B_single * k * 4 flops_s = 2 * B_single * k * n + B_single * k total_mem_s = A_bytes_s + W_bytes + b_bytes + Z_bytes_s intensity_s = flops_s / total_mem_s print(f"\nSingle sample intensity: {intensity_s:.2f} FLOPs/byte") print(f"Batch intensity improvement: {intensity / intensity_s:.1f}x") if __name__ == "__main__": benchmark_matrix_operations() memory_access_analysis()The roofline model visualizes the relationship between arithmetic intensity and achievable performance. Operations with low intensity (like single-sample inference) are memory-bound—performance limited by memory bandwidth. High-intensity operations (large batch matrix multiplies) are compute-bound—limited by processor speed. Understanding this helps choose batch sizes and identify bottlenecks.
Translating matrix formulations into efficient code requires attention to several practical concerns.
Consistent Shape Handling:
reshape or unsqueeze to add batch dimensionIn-Place Operations:
torch.relu_(x)Fused Operations:
F.linear(x, W, b): matmul + bias in one operationF.relu(x, inplace=True): saves memoryData Types:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155
import torchimport torch.nn as nnimport torch.nn.functional as F class EfficientMLP(nn.Module): """ MLP implementation following best practices for efficiency. Key practices: - Fused operations where possible - Proper initialization - Optional mixed precision support - Efficient forward pass implementation """ def __init__( self, in_features: int, hidden_features: list, out_features: int, activation: str = "relu", use_bias: bool = True, ): super().__init__() all_sizes = [in_features] + hidden_features + [out_features] # Build layers self.layers = nn.ModuleList() for i in range(len(all_sizes) - 1): self.layers.append( nn.Linear(all_sizes[i], all_sizes[i+1], bias=use_bias) ) # Store activation choice self.activation = activation self.num_hidden = len(hidden_features) # Initialize weights self._init_weights() def _init_weights(self): """Apply He initialization for ReLU networks.""" for module in self.layers: if isinstance(module, nn.Linear): nn.init.kaiming_normal_(module.weight, nonlinearity='relu') if module.bias is not None: nn.init.zeros_(module.bias) def _get_activation(self, x: torch.Tensor, inplace: bool = True) -> torch.Tensor: """Apply activation function efficiently.""" if self.activation == "relu": return F.relu(x, inplace=inplace) elif self.activation == "gelu": return F.gelu(x) elif self.activation == "silu": return F.silu(x, inplace=inplace) else: raise ValueError(f"Unknown activation: {self.activation}") def forward(self, x: torch.Tensor) -> torch.Tensor: """ Efficient forward pass. Uses in-place operations for activations where safe. """ # Ensure 2D input if x.dim() == 1: x = x.unsqueeze(0) # Hidden layers with activation for layer in self.layers[:-1]: # F.linear is a fused matmul + bias x = self._get_activation(layer(x), inplace=True) # Output layer (no activation) x = self.layers[-1](x) return x class NumericallyStableOutput(nn.Module): """ Numerically stable output layer for classification. Combines log_softmax with the loss for stable training. For inference, can use softmax normally. """ def __init__(self, in_features: int, num_classes: int): super().__init__() self.linear = nn.Linear(in_features, num_classes) def forward(self, x: torch.Tensor, return_logits: bool = True) -> torch.Tensor: """ Forward pass returning logits or log-probabilities. Args: x: Input features return_logits: If True, return raw logits (for use with CrossEntropyLoss) If False, return log_softmax (for NLLLoss or analysis) """ logits = self.linear(x) if return_logits: # CrossEntropyLoss combines log_softmax + nll internally return logits else: # Numerically stable log-softmax return F.log_softmax(logits, dim=-1) def predict_proba(self, x: torch.Tensor) -> torch.Tensor: """Return probabilities for inference.""" with torch.no_grad(): logits = self.linear(x) return F.softmax(logits, dim=-1) # Demonstrationdef demonstrate_efficient_implementation(): """Show efficient MLP in action.""" # Create model model = EfficientMLP( in_features=784, hidden_features=[512, 256, 128], out_features=10, activation="relu" ) # Count parameters total_params = sum(p.numel() for p in model.parameters()) print(f"Model parameters: {total_params:,}") # Test forward pass batch = torch.randn(64, 784) # Warm up (compilation, caching) _ = model(batch) # Benchmark import time n_iterations = 100 start = time.time() for _ in range(n_iterations): output = model(batch) elapsed = time.time() - start print(f"Forward pass: {1000 * elapsed / n_iterations:.3f} ms/batch") print(f"Throughput: {64 * n_iterations / elapsed:.0f} samples/sec") if __name__ == "__main__": demonstrate_efficient_implementation()Matrix formulation transforms neural network computation from abstract algebra into implementable, efficient operations. This foundation enables both theoretical analysis and practical implementation.
What's Next:
With the computational framework established, we turn to the final core component: Activation Functions. These nonlinear functions are what make neural networks more than stacked linear transformations. We'll examine their mathematical properties, gradient behavior, and selection guidelines for different architectures.
You now command the matrix formulation of neural network computation. This understanding directly maps to code: when you write output = X @ W.T + b, you know exactly what mathematical operation is performed, how it relates to the network's function class, and why this representation enables efficient computation. This foundation makes backpropagation, which we'll cover soon, much more intuitive.