Neural Networks & Deep LearningMulti-Layer Perceptrons

Multi-Layer Perceptrons

LevelIntermediate

Duration90 mins

TopicMulti-Layer Perceptrons

3 / 5

Forward Propagation

The Inference Pipeline: From Input to Prediction

Forward propagation is the algorithm that transforms an input vector into an output prediction by sequentially applying the transformations encoded in each layer. It is conceptually simple yet foundational—every prediction a neural network makes, whether classifying an image or generating text, relies on forward propagation.

The term "forward" distinguishes this computation from "backward" propagation (backpropagation), which computes gradients during training. Forward propagation is used both during training (to compute predictions and loss) and during inference (to make predictions on new data).

Understanding forward propagation deeply means understanding:

Exactly how information transforms at each layer
The computational dependencies that enable parallelization
The numerical considerations that affect precision
The algorithmic structure that backpropagation will reverse

This page provides that rigorous understanding, building from first principles to efficient implementation.

What You Will Master

By the end of this page, you will understand: (1) The complete forward propagation algorithm for MLPs; (2) How to trace information flow through network layers; (3) Computational and memory complexity analysis; (4) Numerical stability considerations; (5) Implementation for both single samples and batches.

The Forward Pass Algorithm

Forward propagation proceeds layer by layer, computing activations from input to output. We formalize this precisely.

Algorithm: Forward Propagation

Input:

Input vector $\mathbf{x} \in \mathbb{R}^{n_0}$
Network parameters $\Theta = {(W^{(l)}, \mathbf{b}^{(l)})}_{l=1}^{L}$
Activation functions ${\sigma^{(l)}}_{l=1}^{L}$

Output:

Network output $\mathbf{y} = \mathbf{a}^{(L)}$
(Optionally) All intermediate activations ${\mathbf{a}^{(l)}}$ and pre-activations ${\mathbf{z}^{(l)}}$

Procedure:

1. Initialize: a⁽⁰⁾ ← x
2. For l = 1 to L:
   a. Compute pre-activation: z⁽ˡ⁾ ← W⁽ˡ⁾a⁽ˡ⁻¹⁾ + b⁽ˡ⁾
   b. Compute activation: a⁽ˡ⁾ ← σ⁽ˡ⁾(z⁽ˡ⁾)
3. Return: y ← a⁽ᴸ⁾

Mathematical Expansion:

For a 3-layer network (2 hidden + 1 output):

$$\mathbf{z}^{(1)} = W^{(1)}\mathbf{x} + \mathbf{b}^{(1)} \quad \mathbf{a}^{(1)} = \sigma^{(1)}(\mathbf{z}^{(1)})$$

$$\mathbf{z}^{(2)} = W^{(2)}\mathbf{a}^{(1)} + \mathbf{b}^{(2)} \quad \mathbf{a}^{(2)} = \sigma^{(2)}(\mathbf{z}^{(2)})$$

$$\mathbf{z}^{(3)} = W^{(3)}\mathbf{a}^{(2)} + \mathbf{b}^{(3)} \quad \mathbf{y} = \sigma^{(3)}(\mathbf{z}^{(3)})$$

The network function as a composition:

$$f(\mathbf{x}) = \sigma^{(3)}(W^{(3)} \cdot \sigma^{(2)}(W^{(2)} \cdot \sigma^{(1)}(W^{(1)}\mathbf{x} + \mathbf{b}^{(1)}) + \mathbf{b}^{(2)}) + \mathbf{b}^{(3)})$$

forward_propagation_complete.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
import numpy as np
from typing import List, Tuple, Callable, Dict
from dataclasses import dataclass
 
@dataclass
class ForwardCache:
    """
    Cache storing all intermediate computations during forward pass.
    Required for backpropagation during training.
    """
    inputs: np.ndarray          # Original input x
    pre_activations: List[np.ndarray]  # z^(l) for each layer
    activations: List[np.ndarray]      # a^(l) for each layer, a^(0) = x
 
 
def sigmoid(z: np.ndarray) -> np.ndarray:
    """Numerically stable sigmoid activation."""
    return np.where(
        z >= 0,
        1 / (1 + np.exp(-z)),
        np.exp(z) / (1 + np.exp(z))
    )
 
 
def relu(z: np.ndarray) -> np.ndarray:
    """ReLU activation: max(0, z)."""
    return np.maximum(0, z)
 
 
def softmax(z: np.ndarray) -> np.ndarray:
    """Numerically stable softmax activation."""
    z_shifted = z - np.max(z, axis=-1, keepdims=True)
    exp_z = np.exp(z_shifted)
    return exp_z / np.sum(exp_z, axis=-1, keepdims=True)
 
 
def forward_propagation(
    x: np.ndarray,
    weights: List[np.ndarray],
    biases: List[np.ndarray],
    activations: List[Callable],
    return_cache: bool = True
) -> Tuple[np.ndarray, ForwardCache]:
    """
    Complete forward propagation through an MLP.
    
    This implementation is designed for clarity and correctness,
    demonstrating the exact computation at each step.
    
    Args:
        x: Input array of shape (n_features,) or (batch_size, n_features)
        weights: List of weight matrices W^(l), each shape (n_out, n_in)
        biases: List of bias vectors b^(l), each shape (n_out,)
        activations: List of activation functions σ^(l)
        return_cache: Whether to return intermediate values for backprop
    
    Returns:
        output: Network output of shape (n_output,) or (batch_size, n_output)
        cache: ForwardCache with all intermediate values
    """
    # Ensure input is 2D: (batch_size, n_features)
    if x.ndim == 1:
        x = x.reshape(1, -1)
    
    L = len(weights)  # Number of layers
    
    # Initialize cache
    pre_activations = []
    layer_activations = [x]  # a^(0) = x
    
    a = x  # Current activation
    
    for l in range(L):
        # Pre-activation: z^(l) = a^(l-1) @ W^(l).T + b^(l)
        # Note: We use (batch, features) convention, so we transpose W
        z = a @ weights[l].T + biases[l]
        pre_activations.append(z)
        
        # Activation: a^(l) = σ^(l)(z^(l))
        a = activations[l](z)
        layer_activations.append(a)
    
    cache = ForwardCache(
        inputs=x,
        pre_activations=pre_activations,
        activations=layer_activations
    ) if return_cache else None
    
    return a, cache
 
 
def trace_forward_pass(
    x: np.ndarray,
    weights: List[np.ndarray],
    biases: List[np.ndarray],
    activations: List[Callable],
    activation_names: List[str]
) -> Dict:
    """
    Trace forward pass with detailed output at each step.
    Useful for debugging and understanding information flow.
    """
    print("=" * 60)
    print("FORWARD PROPAGATION TRACE")
    print("=" * 60)
    
    if x.ndim == 1:
        x = x.reshape(1, -1)
    
    print(f"
Input x: shape {x.shape}")
    print(f"  values: {x[0, :5]}..." if x.shape[1] > 5 else f"  values: {x[0]}")
    
    a = x
    trace = {"input": x, "layers": []}
    
    for l in range(len(weights)):
        print(f"
--- Layer {l + 1} ({activation_names[l]}) ---")
        print(f"W^({l+1}) shape: {weights[l].shape}")
        print(f"b^({l+1}) shape: {biases[l].shape}")
        
        # Compute pre-activation
        z = a @ weights[l].T + biases[l]
        print(f"z^({l+1}) = a^({l}) @ W^({l+1}).T + b^({l+1})")
        print(f"  z shape: {z.shape}")
        print(f"  z range: [{z.min():.4f}, {z.max():.4f}]")
        print(f"  z mean: {z.mean():.4f}, std: {z.std():.4f}")
        
        # Compute activation
        a = activations[l](z)
        print(f"a^({l+1}) = {activation_names[l]}(z^({l+1}))")
        print(f"  a shape: {a.shape}")
        print(f"  a range: [{a.min():.4f}, {a.max():.4f}]")
        
        if activation_names[l] == "ReLU":
            sparsity = np.mean(a == 0)
            print(f"  sparsity (fraction of zeros): {sparsity:.2%}")
        
        trace["layers"].append({
            "z": z.copy(),
            "a": a.copy(),
            "W_shape": weights[l].shape,
            "activation": activation_names[l]
        })
    
    print(f"
{'=' * 60}")
    print(f"OUTPUT: shape {a.shape}")
    print(f"  values: {a[0]}")
    print("=" * 60)
    
    trace["output"] = a
    return trace
 
 
# Demonstration
if __name__ == "__main__":
    np.random.seed(42)
    
    # Create a simple network: 4 → 8 → 6 → 3
    architecture = [4, 8, 6, 3]
    
    weights = []
    biases = []
    
    for i in range(len(architecture) - 1):
        n_in = architecture[i]
        n_out = architecture[i + 1]
        # He initialization
        W = np.random.randn(n_out, n_in) * np.sqrt(2.0 / n_in)
        b = np.zeros(n_out)
        weights.append(W)
        biases.append(b)
    
    # Activation functions: ReLU for hidden, softmax for output
    activations = [relu, relu, softmax]
    activation_names = ["ReLU", "ReLU", "Softmax"]
    
    # Single sample
    x = np.random.randn(4)
    
    # Trace the forward pass
    trace = trace_forward_pass(x, weights, biases, activations, activation_names)
    
    # Verify output sums to 1 (softmax property)
    print(f"
Output sum (should be 1.0): {trace['output'].sum():.6f}")

Why Store Intermediate Values?

During training, backpropagation requires the pre-activations z^(l) and activations a^(l) from the forward pass. Storing these in a 'cache' during forward propagation avoids redundant computation. During inference (prediction only), we can discard intermediate values to save memory.

Information Flow Through the Network

Understanding how information transforms at each layer provides intuition for network behavior and debugging.

Layer-wise Transformation:

Each layer performs a two-step transformation:

Affine Transformation ($\mathbf{z} = W\mathbf{a} + \mathbf{b}$):
- Rotates, scales, and translates the representation
- Adds representational dimensions (if layer is wider)
- Compresses information (if layer is narrower)
Nonlinear Activation ($\mathbf{a} = \sigma(\mathbf{z})$):
- Introduces nonlinearity essential for learning complex patterns
- ReLU: Sparsifies (zeros out negative values)
- Sigmoid/Tanh: Squashes to bounded range
- Softmax: Normalizes to probability distribution

Information Bottlenecks:

Narrow layers create information bottlenecks:

Force the network to learn compressed representations
Can improve generalization by discarding noise
Can hurt if too narrow (lose essential information)

Signal Propagation:

The statistics of activations change through layers:

Mean and variance can drift (internal covariate shift)
Batch/layer normalization addresses this
Proper initialization maintains stable statistics

What Happens at Each Stage

•Weight Matrix Multiplication ($W\mathbf{a}$): Each output unit computes a weighted sum of all input activations. The weights encode learned feature detectors—patterns the unit responds to.
•Bias Addition ($+ \mathbf{b}$): Shifts the decision boundary. A positive bias makes the unit more likely to activate; negative bias makes it harder to activate.
•Pre-Activation ($\mathbf{z}$): Can take any real value. In a well-initialized network, typically centered around 0 with moderate variance.
•Activation ($\sigma(\mathbf{z})$): ReLU: half the values become zero (sparsity). Sigmoid: values compressed to (0,1). The output statistics depend heavily on pre-activation distribution.

The Representational Perspective:

Think of each layer as learning a new coordinate system for the data:

Input layer: Raw features (e.g., pixel values)
Early hidden layers: Low-level features (edges, frequencies)
Middle hidden layers: Compositional features (textures, parts)
Late hidden layers: High-level abstractions (objects, concepts)
Output layer: Task-specific (class probabilities, regression values)

Mathematically, if $\mathbf{a}^{(l)}$ is the representation at layer $l$, the linear classifier based on this representation is:

$$f(\mathbf{x}) = W^{(L)}\mathbf{a}^{(L-1)}(\mathbf{x}) + \mathbf{b}^{(L)}$$

The hidden layers learn to transform inputs such that this final linear classifier works well. This is sometimes called "learning useful representations."

Receptive Field Concept:

For any hidden unit, its receptive field is the set of input features that can influence its activation. In fully connected layers, every unit's receptive field is the entire input. In CNNs, receptive fields are local and grow with depth.

Debugging via Activation Statistics

When training fails, examine activation statistics at each layer. Warning signs include: (1) All activations near zero or one (saturation); (2) Extremely high variance (exploding); (3) Decreasing variance through layers (vanishing); (4) Very high sparsity with ReLU (>90% zeros indicates dying ReLU problem). Healthy networks maintain moderate, consistent statistics through layers.

Computational Complexity Analysis

Analyzing the computational cost of forward propagation is essential for understanding scalability and optimizing performance.

Per-Layer Operations:

For layer $l$ with $n_{l-1}$ inputs and $n_l$ outputs:

Matrix-Vector Multiplication $(W^{(l)}\mathbf{a}^{(l-1)})$:
- FLOPs: $2 \cdot n_l \cdot n_{l-1}$ (multiply-add for each element)
- Memory access: $n_l \cdot n_{l-1}$ (weight matrix)
Bias Addition $(+ \mathbf{b}^{(l)})$:
- FLOPs: $n_l$
- Memory access: $n_l$ (bias vector)
Activation $(\sigma(\mathbf{z}^{(l)})$:
- FLOPs: $O(n_l)$ for most activations (element-wise)
- Softmax: $O(n_l)$ for exp/sum/divide

Total Forward Pass:

$$\text{FLOPs} = \sum_{l=1}^{L} \left( 2n_l n_{l-1} + n_l + O(n_l) \right) \approx 2\sum_{l=1}^{L} n_l n_{l-1}$$

The matrix multiplications dominate for practical layer sizes.

Computational Cost for Example Architectures (Single Sample)
Architecture	Total FLOPs	Parameters	Memory (Activations)
784→256→10	~400K	~201K	~1K floats
784→512→256→10	~535K	~534K	~1.6K floats
3072→1024→512→256→10	~3.9M	~3.9M	~5K floats
768→3072→768 (Transformer FFN)	~4.7M	~4.7M	~5K floats

Batch Processing:

For batch of $B$ samples:

Matrix-matrix multiply replaces matrix-vector
FLOPs scale linearly: $B \times 2\sum_l n_l n_{l-1}$
Better hardware utilization (BLAS level-3 operations)
Activation memory: $B \times \sum_l n_l$

Memory Analysis:

During inference (no gradients):

Parameters: $\sum_l (n_l \cdot n_{l-1} + n_l)$ floats (stored once)
Activations: Only need current and previous layer (can overwrite)
Peak: $\max(n_l, n_{l-1})$ floats per sample

During training (need cache):

Activations: $B \times \sum_l n_l$ floats (all layers stored)
Pre-activations: $B \times \sum_l n_l$ floats (for backprop)
Peak memory is typically dominated by storing all activations

GPU Considerations:

Matrix multiplications are highly parallelizable
Layer width ideally a multiple of 32 (GPU warp size)
Memory bandwidth often limits performance, not compute
Fusion of operations (e.g., matmul + bias + activation) reduces memory traffic

The Memory-Compute Trade-off in Training

During training, storing all activations for backprop can exhaust GPU memory for large models or batch sizes. Techniques like gradient checkpointing trade compute for memory: discard activations during forward pass, recompute them during backward pass. This can reduce memory by O(√L) at the cost of one additional forward pass.

Numerical Stability Considerations

Floating-point arithmetic in forward propagation can lead to numerical issues if not handled carefully. Understanding these issues prevents silent failures.

Overflow and Underflow:

Overflow: Numbers too large value > MAX_FLOAT → inf
Underflow: Numbers too small → 0 (loss of precision)

Sigmoid Stability:

Naive: $\sigma(z) = \frac{1}{1 + e^{-z}}$

Problem: For $z << 0$, $e^{-z}$ overflows

Stable version: $$\sigma(z) = \begin{cases} \frac{1}{1 + e^{-z}} & z \geq 0 \ \frac{e^z}{1 + e^z} & z < 0 \end{cases}$$

Softmax Stability:

Naive: $\text{softmax}(z)_i = \frac{e^{z_i}}{\sum_j e^{z_j}}$

Problem: For large $z_i$, $e^{z_i}$ overflows

Solution: Subtract maximum before exp: $$\text{softmax}(z)_i = \frac{e^{z_i - \max(z)}}{\sum_j e^{z_j - \max(z)}}$$

This is mathematically equivalent but numerically stable.

Log-Softmax:

When computing cross-entropy loss, we often need $\log(\text{softmax}(z))$. Computing softmax then log loses precision. Use log-sum-exp trick:

$$\log \text{softmax}(z)_i = z_i - \log \sum_j e^{z_j}$$

where $\log \sum_j e^{z_j}$ uses the stable log-sum-exp:

$$\log \sum_j e^{z_j} = m + \log \sum_j e^{z_j - m}$$

with $m = \max_j z_j$.

numerical_stability.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
import numpy as np
 
def sigmoid_naive(z):
    """Naive sigmoid - can overflow for negative z."""
    return 1 / (1 + np.exp(-z))
 
def sigmoid_stable(z):
    """Numerically stable sigmoid."""
    return np.where(
        z >= 0,
        1 / (1 + np.exp(-z)),
        np.exp(z) / (1 + np.exp(z))
    )
 
def softmax_naive(z):
    """Naive softmax - overflows for large z."""
    exp_z = np.exp(z)
    return exp_z / np.sum(exp_z, axis=-1, keepdims=True)
 
def softmax_stable(z):
    """Numerically stable softmax."""
    z_shifted = z - np.max(z, axis=-1, keepdims=True)
    exp_z = np.exp(z_shifted)
    return exp_z / np.sum(exp_z, axis=-1, keepdims=True)
 
def log_softmax_stable(z):
    """
    Numerically stable log-softmax.
    More accurate than log(softmax(z)) for cross-entropy.
    """
    max_z = np.max(z, axis=-1, keepdims=True)
    log_sum_exp = max_z + np.log(np.sum(np.exp(z - max_z), axis=-1, keepdims=True))
    return z - log_sum_exp
 
def cross_entropy_stable(z, targets):
    """
    Stable cross-entropy from logits.
    
    Args:
        z: Logits (pre-softmax), shape (batch, classes)
        targets: Class indices, shape (batch,)
    
    Returns:
        Cross-entropy loss per sample
    """
    log_probs = log_softmax_stable(z)
    # Select log probability of correct class
    return -log_probs[np.arange(len(targets)), targets]
 
 
# Demonstrate the difference
def demonstrate_numerical_stability():
    print("Numerical Stability Demonstration")
    print("=" * 50)
    
    # Test sigmoid with extreme values
    z_extreme = np.array([-1000, -100, 0, 100, 1000])
    
    print("
Sigmoid with extreme values:")
    print(f"Input z: {z_extreme}")
    
    # Naive (will have issues)
    with np.errstate(over='ignore', invalid='ignore'):
        naive_result = sigmoid_naive(z_extreme)
    print(f"Naive sigmoid: {naive_result}")
    
    # Stable
    stable_result = sigmoid_stable(z_extreme)
    print(f"Stable sigmoid: {stable_result}")
    
    # Test softmax with large values
    print("
Softmax with large values:")
    z_large = np.array([1000, 1001, 1002])
    print(f"Input z: {z_large}")
    
    with np.errstate(over='ignore', invalid='ignore'):
        naive_softmax = softmax_naive(z_large)
    print(f"Naive softmax: {naive_softmax}")
    
    stable_softmax = softmax_stable(z_large)
    print(f"Stable softmax: {stable_softmax}")
    
    # Log-softmax precision
    print("
Log-softmax precision:")
    z = np.array([[0.1, 0.2, 100.0]])  # Very confident prediction
    
    # Naive: log(softmax(z))
    with np.errstate(divide='ignore'):
        naive_log = np.log(softmax_stable(z))
    print(f"Naive log(softmax): {naive_log[0]}")
    
    # Stable log-softmax
    stable_log = log_softmax_stable(z)
    print(f"Stable log_softmax: {stable_log[0]}")
    
    print("
(Stable version preserves precision in the low-probability values)")
 
 
if __name__ == "__main__":
    demonstrate_numerical_stability()

Framework Implementations

Modern deep learning frameworks (PyTorch, TensorFlow, JAX) implement numerically stable versions of these functions. PyTorch's nn.CrossEntropyLoss combines log-softmax and NLL loss in a single stable operation. Always use framework-provided functions rather than manually composing exp, log, and division.

Batch Forward Propagation

Processing multiple samples simultaneously (batch processing) is essential for efficient training and inference. The mathematics extends naturally from single samples.

Notation Convention:

For a batch of $B$ samples:

Input: $X \in \mathbb{R}^{B \times n_0}$ (each row is a sample)
Activations: $A^{(l)} \in \mathbb{R}^{B \times n_l}$
Pre-activations: $Z^{(l)} \in \mathbb{R}^{B \times n_l}$

Batch Forward Pass:

$$Z^{(l)} = A^{(l-1)} W^{(l)\top} + \mathbf{1}_B \mathbf{b}^{(l)\top}$$

$$A^{(l)} = \sigma^{(l)}(Z^{(l)})$$

where $\mathbf{1}_B$ is a column vector of ones (for broadcasting the bias).

In practice (with broadcasting): $$Z^{(l)} = A^{(l-1)} W^{(l)\top} + \mathbf{b}^{(l)}$$

Why Batch Processing?

Hardware Efficiency:
- GPUs excel at matrix-matrix operations (level-3 BLAS)
- Memory bandwidth is amortized across samples
- Parallelism is maximally utilized
Gradient Statistics:
- Batch gradients average over samples → lower variance
- Batch normalization requires multiple samples
- Enables mini-batch SGD optimization
Inference Throughput:
- Process many samples with single kernel launch
- Essential for high-throughput serving

batch_forward_propagation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
import numpy as np
import time
from typing import List, Tuple
 
def batch_forward(
    X: np.ndarray,  # Shape: (batch_size, n_input)
    weights: List[np.ndarray],
    biases: List[np.ndarray],
    activations: List[callable]
) -> Tuple[np.ndarray, List[np.ndarray]]:
    """
    Efficient batch forward propagation.
    
    Uses matrix-matrix multiplication for all samples simultaneously.
    
    Args:
        X: Input batch, shape (batch_size, n_features)
        weights: List of weight matrices
        biases: List of bias vectors  
        activations: List of activation functions
    
    Returns:
        output: Predictions, shape (batch_size, n_output)
        all_activations: List of activation matrices for each layer
    """
    A = X  # Current activation matrix
    all_activations = [X]
    
    for W, b, sigma in zip(weights, biases, activations):
        # Z = A @ W.T + b
        # With broadcasting, b of shape (n_out,) is added to each row
        Z = A @ W.T + b
        A = sigma(Z)
        all_activations.append(A)
    
    return A, all_activations
 
 
def benchmark_batch_sizes():
    """
    Compare computation time for different batch sizes.
    Demonstrates the efficiency gains from batching.
    """
    np.random.seed(42)
    
    # Network: 784 → 512 → 256 → 128 → 64 → 10
    sizes = [784, 512, 256, 128, 64, 10]
    weights = [np.random.randn(sizes[i+1], sizes[i]) * 0.01 
               for i in range(len(sizes)-1)]
    biases = [np.zeros(sizes[i+1]) for i in range(len(sizes)-1)]
    activations = [lambda z: np.maximum(0, z)] * 4 + [lambda z: z]  # ReLU + Linear
    
    print("Batch Size Benchmark")
    print("=" * 60)
    print(f"Network: {' → '.join(map(str, sizes))}")
    print("-" * 60)
    
    # Test different batch sizes
    n_samples = 10000
    batch_sizes = [1, 8, 32, 128, 512, 2048, n_samples]
    
    for batch_size in batch_sizes:
        n_batches = n_samples // batch_size
        
        # Generate all data
        X_all = np.random.randn(n_samples, 784).astype(np.float32)
        
        # Time the processing
        start = time.time()
        
        for i in range(n_batches):
            X_batch = X_all[i*batch_size:(i+1)*batch_size]
            output, _ = batch_forward(X_batch, weights, biases, activations)
        
        elapsed = time.time() - start
        samples_per_sec = n_samples / elapsed
        
        print(f"Batch size {batch_size:5d}: {elapsed:.3f}s | "
              f"{samples_per_sec:,.0f} samples/sec | "
              f"{n_batches} batches")
    
    print("-" * 60)
    print("Larger batches = better hardware utilization = higher throughput")
 
 
def verify_batch_equivalence():
    """
    Verify that batch processing gives same results as processing
    samples individually.
    """
    np.random.seed(42)
    
    # Simple network
    sizes = [10, 20, 5]
    weights = [np.random.randn(sizes[i+1], sizes[i]) * 0.1 
               for i in range(len(sizes)-1)]
    biases = [np.zeros(sizes[i+1]) for i in range(len(sizes)-1)]
    activations = [lambda z: np.maximum(0, z), lambda z: z]
    
    # Batch of 5 samples
    X = np.random.randn(5, 10)
    
    # Process as batch
    batch_output, _ = batch_forward(X, weights, biases, activations)
    
    # Process individually
    individual_outputs = []
    for i in range(5):
        x = X[i:i+1]  # Keep 2D
        out, _ = batch_forward(x, weights, biases, activations)
        individual_outputs.append(out[0])
    individual_output = np.array(individual_outputs)
    
    # Compare
    max_diff = np.max(np.abs(batch_output - individual_output))
    print(f"Max difference: {max_diff}")
    print(f"Batch and individual processing are {'equivalent' if max_diff < 1e-10 else 'NOT equivalent'}!")
 
 
if __name__ == "__main__":
    verify_batch_equivalence()
    print()
    benchmark_batch_sizes()

Optimal Batch Size

The optimal batch size balances multiple factors: (1) Memory limits (larger batch = more activation memory); (2) Hardware efficiency (larger batch up to a point); (3) Gradient quality (too large → fewer updates per epoch); (4) Generalization (some evidence that smaller batches generalize better). Start with 32-256 for most problems, then tune based on memory and learning curve.

Forward Pass in Modern Frameworks

Modern deep learning frameworks automate forward propagation and, critically, automatically compute gradients via automatic differentiation. Understanding the abstraction helps effective use.

PyTorch Paradigm:

PyTorch uses dynamic computational graphs:

Operations build the graph as they execute
Graph is used for backward pass, then discarded
Flexibility allows control flow (loops, conditionals)

The forward() Method:

Custom models define forward() to specify computation:

class MLP(nn.Module):
    def __init__(self, ...):
        # Define layers as attributes
        
    def forward(self, x):
        # Define forward computation
        # Return output

Calling model(x) invokes forward() and builds computation graph.

pytorch_forward_pass.py
PyTorch
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class MLPWithHooks(nn.Module):
    """
    MLP with hooks for observing forward pass internals.
    
    Demonstrates:
    - Standard forward() implementation
    - Forward hooks for debugging/analysis
    - Intermediate activation extraction
    """
    
    def __init__(self, layer_sizes: list, activation: str = "relu"):
        super().__init__()
        
        self.layer_sizes = layer_sizes
        self.activation_name = activation
        
        # Create layers
        self.layers = nn.ModuleList()
        for i in range(len(layer_sizes) - 1):
            self.layers.append(nn.Linear(layer_sizes[i], layer_sizes[i+1]))
        
        # For storing activations during forward pass
        self.activations = {}
        self._register_hooks()
    
    def _get_activation_fn(self):
        """Return activation function."""
        if self.activation_name == "relu":
            return F.relu
        elif self.activation_name == "gelu":
            return F.gelu
        elif self.activation_name == "tanh":
            return torch.tanh
        else:
            raise ValueError(f"Unknown activation: {self.activation_name}")
    
    def _register_hooks(self):
        """Register forward hooks to capture activations."""
        def get_hook(name):
            def hook(module, input, output):
                self.activations[name] = output.detach()
            return hook
        
        for i, layer in enumerate(self.layers):
            layer.register_forward_hook(get_hook(f"layer_{i}"))
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass through the network.
        
        Args:
            x: Input tensor of shape (batch_size, input_dim)
        
        Returns:
            Output tensor of shape (batch_size, output_dim)
        """
        activation_fn = self._get_activation_fn()
        
        # Forward through all layers except last
        for layer in self.layers[:-1]:
            x = activation_fn(layer(x))
        
        # Last layer without activation (raw logits)
        x = self.layers[-1](x)
        
        return x
    
    def forward_with_intermediates(self, x: torch.Tensor) -> dict:
        """
        Forward pass returning all intermediate values.
        Useful for analysis and debugging.
        """
        intermediates = {"input": x.detach().clone()}
        activation_fn = self._get_activation_fn()
        
        for i, layer in enumerate(self.layers[:-1]):
            z = layer(x)  # Pre-activation
            x = activation_fn(z)  # Post-activation
            intermediates[f"layer_{i}_pre"] = z.detach()
            intermediates[f"layer_{i}_post"] = x.detach()
        
        # Final layer
        z = self.layers[-1](x)
        intermediates["output"] = z.detach()
        
        return z, intermediates
 
 
def demonstrate_forward_pass():
    """Demonstrate forward pass with analysis."""
    
    # Create model
    model = MLPWithHooks([784, 256, 128, 64, 10], activation="relu")
    
    # Random batch
    batch_size = 32
    x = torch.randn(batch_size, 784)
    
    # Forward pass
    output = model(x)
    
    print("Forward Pass Analysis")
    print("=" * 50)
    print(f"Input shape: {x.shape}")
    print(f"Output shape: {output.shape}")
    
    print("
Layer-wise activation statistics:")
    for name, activation in model.activations.items():
        mean = activation.mean().item()
        std = activation.std().item()
        sparsity = (activation == 0).float().mean().item()
        print(f"  {name}: mean={mean:.4f}, std={std:.4f}, sparsity={sparsity:.2%}")
    
    # Forward with intermediates
    output, intermediates = model.forward_with_intermediates(x)
    
    print("
Intermediate tensor shapes:")
    for name, tensor in intermediates.items():
        print(f"  {name}: {tensor.shape}")
    
    # Verify gradient flow
    print("
Gradient flow test:")
    loss = output.sum()
    loss.backward()
    
    for i, layer in enumerate(model.layers):
        grad_norm = layer.weight.grad.norm().item()
        print(f"  Layer {i} weight gradient norm: {grad_norm:.4f}")
 
 
if __name__ == "__main__":
    demonstrate_forward_pass()

Autograd: The Magic Behind the Scenes

When you call forward(), PyTorch builds a computational graph tracking every operation. Each tensor knows what operation created it and what inputs were used. When you call backward() on a loss, PyTorch traverses this graph in reverse, computing gradients via the chain rule. This automatic differentiation is what makes neural network training tractable.

Summary: Forward Propagation Mastery

Forward propagation is the fundamental computation that makes neural networks function. Every prediction, whether during training or inference, executes this algorithm.

Key Concepts Mastered

•The algorithm: Layer-by-layer computation of z = Wa + b, then a = σ(z), from input to output.
•Information flow: Each layer transforms representations via affine transformation + nonlinearity, progressively shaping data for the task.
•Computational complexity: O(∑ nₗnₗ₋₁) FLOPs per sample, dominated by matrix multiplications.
•Numerical stability: Stable sigmoid, softmax, and log-softmax formulations prevent overflow and preserve precision.
•Batch processing: Processing multiple samples simultaneously via matrix-matrix operations dramatically improves hardware efficiency.
•Framework integration: Modern frameworks automate forward pass and build computation graphs for automatic differentiation.

What's Next:

We've covered how single samples and batches flow through the network. The next page presents the Matrix Formulation—expressing the entire forward pass as a sequence of matrix operations. This formalization is not just notational convenience; it's the key to efficient implementation on parallel hardware and the foundation for understanding batch normalization, attention mechanisms, and other modern components.

Page Complete

You now deeply understand forward propagation—the inference algorithm that transforms inputs into predictions. This understanding is essential for the next major topic: backpropagation, which reverses this computation to compute gradients for learning. The better you understand forward propagation, the easier backpropagation becomes.

3 / 5

Loading learning content...

Neural Networks & Deep LearningMulti-Layer Perceptrons

Multi-Layer Perceptrons

LevelIntermediate

Duration90 mins

TopicMulti-Layer Perceptrons

3 / 5

Forward Propagation

The Inference Pipeline: From Input to Prediction

Understanding forward propagation deeply means understanding:

Exactly how information transforms at each layer
The computational dependencies that enable parallelization
The numerical considerations that affect precision
The algorithmic structure that backpropagation will reverse

This page provides that rigorous understanding, building from first principles to efficient implementation.

What You Will Master

The Forward Pass Algorithm

Forward propagation proceeds layer by layer, computing activations from input to output. We formalize this precisely.

Algorithm: Forward Propagation

Input:

Input vector $\mathbf{x} \in \mathbb{R}^{n_0}$
Network parameters $\Theta = {(W^{(l)}, \mathbf{b}^{(l)})}_{l=1}^{L}$
Activation functions ${\sigma^{(l)}}_{l=1}^{L}$

Output:

Network output $\mathbf{y} = \mathbf{a}^{(L)}$
(Optionally) All intermediate activations ${\mathbf{a}^{(l)}}$ and pre-activations ${\mathbf{z}^{(l)}}$

Procedure:

1. Initialize: a⁽⁰⁾ ← x
2. For l = 1 to L:
   a. Compute pre-activation: z⁽ˡ⁾ ← W⁽ˡ⁾a⁽ˡ⁻¹⁾ + b⁽ˡ⁾
   b. Compute activation: a⁽ˡ⁾ ← σ⁽ˡ⁾(z⁽ˡ⁾)
3. Return: y ← a⁽ᴸ⁾

Mathematical Expansion:

For a 3-layer network (2 hidden + 1 output):

$$\mathbf{z}^{(1)} = W^{(1)}\mathbf{x} + \mathbf{b}^{(1)} \quad \mathbf{a}^{(1)} = \sigma^{(1)}(\mathbf{z}^{(1)})$$

$$\mathbf{z}^{(2)} = W^{(2)}\mathbf{a}^{(1)} + \mathbf{b}^{(2)} \quad \mathbf{a}^{(2)} = \sigma^{(2)}(\mathbf{z}^{(2)})$$

$$\mathbf{z}^{(3)} = W^{(3)}\mathbf{a}^{(2)} + \mathbf{b}^{(3)} \quad \mathbf{y} = \sigma^{(3)}(\mathbf{z}^{(3)})$$

The network function as a composition:

$$f(\mathbf{x}) = \sigma^{(3)}(W^{(3)} \cdot \sigma^{(2)}(W^{(2)} \cdot \sigma^{(1)}(W^{(1)}\mathbf{x} + \mathbf{b}^{(1)}) + \mathbf{b}^{(2)}) + \mathbf{b}^{(3)})$$

forward_propagation_complete.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
import numpy as np
from typing import List, Tuple, Callable, Dict
from dataclasses import dataclass
 
@dataclass
class ForwardCache:
    """
    Cache storing all intermediate computations during forward pass.
    Required for backpropagation during training.
    """
    inputs: np.ndarray          # Original input x
    pre_activations: List[np.ndarray]  # z^(l) for each layer
    activations: List[np.ndarray]      # a^(l) for each layer, a^(0) = x
 
 
def sigmoid(z: np.ndarray) -> np.ndarray:
    """Numerically stable sigmoid activation."""
    return np.where(
        z >= 0,
        1 / (1 + np.exp(-z)),
        np.exp(z) / (1 + np.exp(z))
    )
 
 
def relu(z: np.ndarray) -> np.ndarray:
    """ReLU activation: max(0, z)."""
    return np.maximum(0, z)
 
 
def softmax(z: np.ndarray) -> np.ndarray:
    """Numerically stable softmax activation."""
    z_shifted = z - np.max(z, axis=-1, keepdims=True)
    exp_z = np.exp(z_shifted)
    return exp_z / np.sum(exp_z, axis=-1, keepdims=True)
 
 
def forward_propagation(
    x: np.ndarray,
    weights: List[np.ndarray],
    biases: List[np.ndarray],
    activations: List[Callable],
    return_cache: bool = True
) -> Tuple[np.ndarray, ForwardCache]:
    """
    Complete forward propagation through an MLP.
    
    This implementation is designed for clarity and correctness,
    demonstrating the exact computation at each step.
    
    Args:
        x: Input array of shape (n_features,) or (batch_size, n_features)
        weights: List of weight matrices W^(l), each shape (n_out, n_in)
        biases: List of bias vectors b^(l), each shape (n_out,)
        activations: List of activation functions σ^(l)
        return_cache: Whether to return intermediate values for backprop
    
    Returns:
        output: Network output of shape (n_output,) or (batch_size, n_output)
        cache: ForwardCache with all intermediate values
    """
    # Ensure input is 2D: (batch_size, n_features)
    if x.ndim == 1:
        x = x.reshape(1, -1)
    
    L = len(weights)  # Number of layers
    
    # Initialize cache
    pre_activations = []
    layer_activations = [x]  # a^(0) = x
    
    a = x  # Current activation
    
    for l in range(L):
        # Pre-activation: z^(l) = a^(l-1) @ W^(l).T + b^(l)
        # Note: We use (batch, features) convention, so we transpose W
        z = a @ weights[l].T + biases[l]
        pre_activations.append(z)
        
        # Activation: a^(l) = σ^(l)(z^(l))
        a = activations[l](z)
        layer_activations.append(a)
    
    cache = ForwardCache(
        inputs=x,
        pre_activations=pre_activations,
        activations=layer_activations
    ) if return_cache else None
    
    return a, cache
 
 
def trace_forward_pass(
    x: np.ndarray,
    weights: List[np.ndarray],
    biases: List[np.ndarray],
    activations: List[Callable],
    activation_names: List[str]
) -> Dict:
    """
    Trace forward pass with detailed output at each step.
    Useful for debugging and understanding information flow.
    """
    print("=" * 60)
    print("FORWARD PROPAGATION TRACE")
    print("=" * 60)
    
    if x.ndim == 1:
        x = x.reshape(1, -1)
    
    print(f"
Input x: shape {x.shape}")
    print(f"  values: {x[0, :5]}..." if x.shape[1] > 5 else f"  values: {x[0]}")
    
    a = x
    trace = {"input": x, "layers": []}
    
    for l in range(len(weights)):
        print(f"
--- Layer {l + 1} ({activation_names[l]}) ---")
        print(f"W^({l+1}) shape: {weights[l].shape}")
        print(f"b^({l+1}) shape: {biases[l].shape}")
        
        # Compute pre-activation
        z = a @ weights[l].T + biases[l]
        print(f"z^({l+1}) = a^({l}) @ W^({l+1}).T + b^({l+1})")
        print(f"  z shape: {z.shape}")
        print(f"  z range: [{z.min():.4f}, {z.max():.4f}]")
        print(f"  z mean: {z.mean():.4f}, std: {z.std():.4f}")
        
        # Compute activation
        a = activations[l](z)
        print(f"a^({l+1}) = {activation_names[l]}(z^({l+1}))")
        print(f"  a shape: {a.shape}")
        print(f"  a range: [{a.min():.4f}, {a.max():.4f}]")
        
        if activation_names[l] == "ReLU":
            sparsity = np.mean(a == 0)
            print(f"  sparsity (fraction of zeros): {sparsity:.2%}")
        
        trace["layers"].append({
            "z": z.copy(),
            "a": a.copy(),
            "W_shape": weights[l].shape,
            "activation": activation_names[l]
        })
    
    print(f"
{'=' * 60}")
    print(f"OUTPUT: shape {a.shape}")
    print(f"  values: {a[0]}")
    print("=" * 60)
    
    trace["output"] = a
    return trace
 
 
# Demonstration
if __name__ == "__main__":
    np.random.seed(42)
    
    # Create a simple network: 4 → 8 → 6 → 3
    architecture = [4, 8, 6, 3]
    
    weights = []
    biases = []
    
    for i in range(len(architecture) - 1):
        n_in = architecture[i]
        n_out = architecture[i + 1]
        # He initialization
        W = np.random.randn(n_out, n_in) * np.sqrt(2.0 / n_in)
        b = np.zeros(n_out)
        weights.append(W)
        biases.append(b)
    
    # Activation functions: ReLU for hidden, softmax for output
    activations = [relu, relu, softmax]
    activation_names = ["ReLU", "ReLU", "Softmax"]
    
    # Single sample
    x = np.random.randn(4)
    
    # Trace the forward pass
    trace = trace_forward_pass(x, weights, biases, activations, activation_names)
    
    # Verify output sums to 1 (softmax property)
    print(f"
Output sum (should be 1.0): {trace['output'].sum():.6f}")

Why Store Intermediate Values?

Information Flow Through the Network

Understanding how information transforms at each layer provides intuition for network behavior and debugging.

Layer-wise Transformation:

Each layer performs a two-step transformation:

Affine Transformation ($\mathbf{z} = W\mathbf{a} + \mathbf{b}$):
- Rotates, scales, and translates the representation
- Adds representational dimensions (if layer is wider)
- Compresses information (if layer is narrower)
Nonlinear Activation ($\mathbf{a} = \sigma(\mathbf{z})$):
- Introduces nonlinearity essential for learning complex patterns
- ReLU: Sparsifies (zeros out negative values)
- Sigmoid/Tanh: Squashes to bounded range
- Softmax: Normalizes to probability distribution

Information Bottlenecks:

Narrow layers create information bottlenecks:

Force the network to learn compressed representations
Can improve generalization by discarding noise
Can hurt if too narrow (lose essential information)

Signal Propagation:

The statistics of activations change through layers:

Mean and variance can drift (internal covariate shift)
Batch/layer normalization addresses this
Proper initialization maintains stable statistics

What Happens at Each Stage

•Weight Matrix Multiplication ($W\mathbf{a}$): Each output unit computes a weighted sum of all input activations. The weights encode learned feature detectors—patterns the unit responds to.
•Bias Addition ($+ \mathbf{b}$): Shifts the decision boundary. A positive bias makes the unit more likely to activate; negative bias makes it harder to activate.
•Pre-Activation ($\mathbf{z}$): Can take any real value. In a well-initialized network, typically centered around 0 with moderate variance.
•Activation ($\sigma(\mathbf{z})$): ReLU: half the values become zero (sparsity). Sigmoid: values compressed to (0,1). The output statistics depend heavily on pre-activation distribution.

The Representational Perspective:

Think of each layer as learning a new coordinate system for the data:

Input layer: Raw features (e.g., pixel values)
Early hidden layers: Low-level features (edges, frequencies)
Middle hidden layers: Compositional features (textures, parts)
Late hidden layers: High-level abstractions (objects, concepts)
Output layer: Task-specific (class probabilities, regression values)

Mathematically, if $\mathbf{a}^{(l)}$ is the representation at layer $l$, the linear classifier based on this representation is:

$$f(\mathbf{x}) = W^{(L)}\mathbf{a}^{(L-1)}(\mathbf{x}) + \mathbf{b}^{(L)}$$

The hidden layers learn to transform inputs such that this final linear classifier works well. This is sometimes called "learning useful representations."

Receptive Field Concept:

Debugging via Activation Statistics

Computational Complexity Analysis

Analyzing the computational cost of forward propagation is essential for understanding scalability and optimizing performance.

Per-Layer Operations:

For layer $l$ with $n_{l-1}$ inputs and $n_l$ outputs:

Matrix-Vector Multiplication $(W^{(l)}\mathbf{a}^{(l-1)})$:
- FLOPs: $2 \cdot n_l \cdot n_{l-1}$ (multiply-add for each element)
- Memory access: $n_l \cdot n_{l-1}$ (weight matrix)
Bias Addition $(+ \mathbf{b}^{(l)})$:
- FLOPs: $n_l$
- Memory access: $n_l$ (bias vector)
Activation $(\sigma(\mathbf{z}^{(l)})$:
- FLOPs: $O(n_l)$ for most activations (element-wise)
- Softmax: $O(n_l)$ for exp/sum/divide

Total Forward Pass:

$$\text{FLOPs} = \sum_{l=1}^{L} \left( 2n_l n_{l-1} + n_l + O(n_l) \right) \approx 2\sum_{l=1}^{L} n_l n_{l-1}$$

The matrix multiplications dominate for practical layer sizes.

Computational Cost for Example Architectures (Single Sample)
Architecture	Total FLOPs	Parameters	Memory (Activations)
784→256→10	~400K	~201K	~1K floats
784→512→256→10	~535K	~534K	~1.6K floats
3072→1024→512→256→10	~3.9M	~3.9M	~5K floats
768→3072→768 (Transformer FFN)	~4.7M	~4.7M	~5K floats

Batch Processing:

For batch of $B$ samples:

Matrix-matrix multiply replaces matrix-vector
FLOPs scale linearly: $B \times 2\sum_l n_l n_{l-1}$
Better hardware utilization (BLAS level-3 operations)
Activation memory: $B \times \sum_l n_l$

Memory Analysis:

During inference (no gradients):

Parameters: $\sum_l (n_l \cdot n_{l-1} + n_l)$ floats (stored once)
Activations: Only need current and previous layer (can overwrite)
Peak: $\max(n_l, n_{l-1})$ floats per sample

During training (need cache):

Activations: $B \times \sum_l n_l$ floats (all layers stored)
Pre-activations: $B \times \sum_l n_l$ floats (for backprop)
Peak memory is typically dominated by storing all activations

GPU Considerations:

Matrix multiplications are highly parallelizable
Layer width ideally a multiple of 32 (GPU warp size)
Memory bandwidth often limits performance, not compute
Fusion of operations (e.g., matmul + bias + activation) reduces memory traffic

The Memory-Compute Trade-off in Training

Numerical Stability Considerations

Floating-point arithmetic in forward propagation can lead to numerical issues if not handled carefully. Understanding these issues prevents silent failures.

Overflow and Underflow:

Overflow: Numbers too large value > MAX_FLOAT → inf
Underflow: Numbers too small → 0 (loss of precision)

Sigmoid Stability:

Naive: $\sigma(z) = \frac{1}{1 + e^{-z}}$

Problem: For $z << 0$, $e^{-z}$ overflows

Stable version: $$\sigma(z) = \begin{cases} \frac{1}{1 + e^{-z}} & z \geq 0 \ \frac{e^z}{1 + e^z} & z < 0 \end{cases}$$

Softmax Stability:

Naive: $\text{softmax}(z)_i = \frac{e^{z_i}}{\sum_j e^{z_j}}$

Problem: For large $z_i$, $e^{z_i}$ overflows

Solution: Subtract maximum before exp: $$\text{softmax}(z)_i = \frac{e^{z_i - \max(z)}}{\sum_j e^{z_j - \max(z)}}$$

This is mathematically equivalent but numerically stable.

Log-Softmax:

When computing cross-entropy loss, we often need $\log(\text{softmax}(z))$. Computing softmax then log loses precision. Use log-sum-exp trick:

$$\log \text{softmax}(z)_i = z_i - \log \sum_j e^{z_j}$$

where $\log \sum_j e^{z_j}$ uses the stable log-sum-exp:

$$\log \sum_j e^{z_j} = m + \log \sum_j e^{z_j - m}$$

with $m = \max_j z_j$.

numerical_stability.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
import numpy as np
 
def sigmoid_naive(z):
    """Naive sigmoid - can overflow for negative z."""
    return 1 / (1 + np.exp(-z))
 
def sigmoid_stable(z):
    """Numerically stable sigmoid."""
    return np.where(
        z >= 0,
        1 / (1 + np.exp(-z)),
        np.exp(z) / (1 + np.exp(z))
    )
 
def softmax_naive(z):
    """Naive softmax - overflows for large z."""
    exp_z = np.exp(z)
    return exp_z / np.sum(exp_z, axis=-1, keepdims=True)
 
def softmax_stable(z):
    """Numerically stable softmax."""
    z_shifted = z - np.max(z, axis=-1, keepdims=True)
    exp_z = np.exp(z_shifted)
    return exp_z / np.sum(exp_z, axis=-1, keepdims=True)
 
def log_softmax_stable(z):
    """
    Numerically stable log-softmax.
    More accurate than log(softmax(z)) for cross-entropy.
    """
    max_z = np.max(z, axis=-1, keepdims=True)
    log_sum_exp = max_z + np.log(np.sum(np.exp(z - max_z), axis=-1, keepdims=True))
    return z - log_sum_exp
 
def cross_entropy_stable(z, targets):
    """
    Stable cross-entropy from logits.
    
    Args:
        z: Logits (pre-softmax), shape (batch, classes)
        targets: Class indices, shape (batch,)
    
    Returns:
        Cross-entropy loss per sample
    """
    log_probs = log_softmax_stable(z)
    # Select log probability of correct class
    return -log_probs[np.arange(len(targets)), targets]
 
 
# Demonstrate the difference
def demonstrate_numerical_stability():
    print("Numerical Stability Demonstration")
    print("=" * 50)
    
    # Test sigmoid with extreme values
    z_extreme = np.array([-1000, -100, 0, 100, 1000])
    
    print("
Sigmoid with extreme values:")
    print(f"Input z: {z_extreme}")
    
    # Naive (will have issues)
    with np.errstate(over='ignore', invalid='ignore'):
        naive_result = sigmoid_naive(z_extreme)
    print(f"Naive sigmoid: {naive_result}")
    
    # Stable
    stable_result = sigmoid_stable(z_extreme)
    print(f"Stable sigmoid: {stable_result}")
    
    # Test softmax with large values
    print("
Softmax with large values:")
    z_large = np.array([1000, 1001, 1002])
    print(f"Input z: {z_large}")
    
    with np.errstate(over='ignore', invalid='ignore'):
        naive_softmax = softmax_naive(z_large)
    print(f"Naive softmax: {naive_softmax}")
    
    stable_softmax = softmax_stable(z_large)
    print(f"Stable softmax: {stable_softmax}")
    
    # Log-softmax precision
    print("
Log-softmax precision:")
    z = np.array([[0.1, 0.2, 100.0]])  # Very confident prediction
    
    # Naive: log(softmax(z))
    with np.errstate(divide='ignore'):
        naive_log = np.log(softmax_stable(z))
    print(f"Naive log(softmax): {naive_log[0]}")
    
    # Stable log-softmax
    stable_log = log_softmax_stable(z)
    print(f"Stable log_softmax: {stable_log[0]}")
    
    print("
(Stable version preserves precision in the low-probability values)")
 
 
if __name__ == "__main__":
    demonstrate_numerical_stability()

Framework Implementations

Batch Forward Propagation

Processing multiple samples simultaneously (batch processing) is essential for efficient training and inference. The mathematics extends naturally from single samples.

Notation Convention:

For a batch of $B$ samples:

Input: $X \in \mathbb{R}^{B \times n_0}$ (each row is a sample)
Activations: $A^{(l)} \in \mathbb{R}^{B \times n_l}$
Pre-activations: $Z^{(l)} \in \mathbb{R}^{B \times n_l}$

Batch Forward Pass:

$$Z^{(l)} = A^{(l-1)} W^{(l)\top} + \mathbf{1}_B \mathbf{b}^{(l)\top}$$

$$A^{(l)} = \sigma^{(l)}(Z^{(l)})$$

where $\mathbf{1}_B$ is a column vector of ones (for broadcasting the bias).

In practice (with broadcasting): $$Z^{(l)} = A^{(l-1)} W^{(l)\top} + \mathbf{b}^{(l)}$$

Why Batch Processing?

Hardware Efficiency:
- GPUs excel at matrix-matrix operations (level-3 BLAS)
- Memory bandwidth is amortized across samples
- Parallelism is maximally utilized
Gradient Statistics:
- Batch gradients average over samples → lower variance
- Batch normalization requires multiple samples
- Enables mini-batch SGD optimization
Inference Throughput:
- Process many samples with single kernel launch
- Essential for high-throughput serving

batch_forward_propagation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
import numpy as np
import time
from typing import List, Tuple
 
def batch_forward(
    X: np.ndarray,  # Shape: (batch_size, n_input)
    weights: List[np.ndarray],
    biases: List[np.ndarray],
    activations: List[callable]
) -> Tuple[np.ndarray, List[np.ndarray]]:
    """
    Efficient batch forward propagation.
    
    Uses matrix-matrix multiplication for all samples simultaneously.
    
    Args:
        X: Input batch, shape (batch_size, n_features)
        weights: List of weight matrices
        biases: List of bias vectors  
        activations: List of activation functions
    
    Returns:
        output: Predictions, shape (batch_size, n_output)
        all_activations: List of activation matrices for each layer
    """
    A = X  # Current activation matrix
    all_activations = [X]
    
    for W, b, sigma in zip(weights, biases, activations):
        # Z = A @ W.T + b
        # With broadcasting, b of shape (n_out,) is added to each row
        Z = A @ W.T + b
        A = sigma(Z)
        all_activations.append(A)
    
    return A, all_activations
 
 
def benchmark_batch_sizes():
    """
    Compare computation time for different batch sizes.
    Demonstrates the efficiency gains from batching.
    """
    np.random.seed(42)
    
    # Network: 784 → 512 → 256 → 128 → 64 → 10
    sizes = [784, 512, 256, 128, 64, 10]
    weights = [np.random.randn(sizes[i+1], sizes[i]) * 0.01 
               for i in range(len(sizes)-1)]
    biases = [np.zeros(sizes[i+1]) for i in range(len(sizes)-1)]
    activations = [lambda z: np.maximum(0, z)] * 4 + [lambda z: z]  # ReLU + Linear
    
    print("Batch Size Benchmark")
    print("=" * 60)
    print(f"Network: {' → '.join(map(str, sizes))}")
    print("-" * 60)
    
    # Test different batch sizes
    n_samples = 10000
    batch_sizes = [1, 8, 32, 128, 512, 2048, n_samples]
    
    for batch_size in batch_sizes:
        n_batches = n_samples // batch_size
        
        # Generate all data
        X_all = np.random.randn(n_samples, 784).astype(np.float32)
        
        # Time the processing
        start = time.time()
        
        for i in range(n_batches):
            X_batch = X_all[i*batch_size:(i+1)*batch_size]
            output, _ = batch_forward(X_batch, weights, biases, activations)
        
        elapsed = time.time() - start
        samples_per_sec = n_samples / elapsed
        
        print(f"Batch size {batch_size:5d}: {elapsed:.3f}s | "
              f"{samples_per_sec:,.0f} samples/sec | "
              f"{n_batches} batches")
    
    print("-" * 60)
    print("Larger batches = better hardware utilization = higher throughput")
 
 
def verify_batch_equivalence():
    """
    Verify that batch processing gives same results as processing
    samples individually.
    """
    np.random.seed(42)
    
    # Simple network
    sizes = [10, 20, 5]
    weights = [np.random.randn(sizes[i+1], sizes[i]) * 0.1 
               for i in range(len(sizes)-1)]
    biases = [np.zeros(sizes[i+1]) for i in range(len(sizes)-1)]
    activations = [lambda z: np.maximum(0, z), lambda z: z]
    
    # Batch of 5 samples
    X = np.random.randn(5, 10)
    
    # Process as batch
    batch_output, _ = batch_forward(X, weights, biases, activations)
    
    # Process individually
    individual_outputs = []
    for i in range(5):
        x = X[i:i+1]  # Keep 2D
        out, _ = batch_forward(x, weights, biases, activations)
        individual_outputs.append(out[0])
    individual_output = np.array(individual_outputs)
    
    # Compare
    max_diff = np.max(np.abs(batch_output - individual_output))
    print(f"Max difference: {max_diff}")
    print(f"Batch and individual processing are {'equivalent' if max_diff < 1e-10 else 'NOT equivalent'}!")
 
 
if __name__ == "__main__":
    verify_batch_equivalence()
    print()
    benchmark_batch_sizes()

Optimal Batch Size

Forward Pass in Modern Frameworks

Modern deep learning frameworks automate forward propagation and, critically, automatically compute gradients via automatic differentiation. Understanding the abstraction helps effective use.

PyTorch Paradigm:

PyTorch uses dynamic computational graphs:

Operations build the graph as they execute
Graph is used for backward pass, then discarded
Flexibility allows control flow (loops, conditionals)

The forward() Method:

Custom models define forward() to specify computation:

class MLP(nn.Module):
    def __init__(self, ...):
        # Define layers as attributes
        
    def forward(self, x):
        # Define forward computation
        # Return output

Calling model(x) invokes forward() and builds computation graph.

pytorch_forward_pass.py
PyTorch
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class MLPWithHooks(nn.Module):
    """
    MLP with hooks for observing forward pass internals.
    
    Demonstrates:
    - Standard forward() implementation
    - Forward hooks for debugging/analysis
    - Intermediate activation extraction
    """
    
    def __init__(self, layer_sizes: list, activation: str = "relu"):
        super().__init__()
        
        self.layer_sizes = layer_sizes
        self.activation_name = activation
        
        # Create layers
        self.layers = nn.ModuleList()
        for i in range(len(layer_sizes) - 1):
            self.layers.append(nn.Linear(layer_sizes[i], layer_sizes[i+1]))
        
        # For storing activations during forward pass
        self.activations = {}
        self._register_hooks()
    
    def _get_activation_fn(self):
        """Return activation function."""
        if self.activation_name == "relu":
            return F.relu
        elif self.activation_name == "gelu":
            return F.gelu
        elif self.activation_name == "tanh":
            return torch.tanh
        else:
            raise ValueError(f"Unknown activation: {self.activation_name}")
    
    def _register_hooks(self):
        """Register forward hooks to capture activations."""
        def get_hook(name):
            def hook(module, input, output):
                self.activations[name] = output.detach()
            return hook
        
        for i, layer in enumerate(self.layers):
            layer.register_forward_hook(get_hook(f"layer_{i}"))
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass through the network.
        
        Args:
            x: Input tensor of shape (batch_size, input_dim)
        
        Returns:
            Output tensor of shape (batch_size, output_dim)
        """
        activation_fn = self._get_activation_fn()
        
        # Forward through all layers except last
        for layer in self.layers[:-1]:
            x = activation_fn(layer(x))
        
        # Last layer without activation (raw logits)
        x = self.layers[-1](x)
        
        return x
    
    def forward_with_intermediates(self, x: torch.Tensor) -> dict:
        """
        Forward pass returning all intermediate values.
        Useful for analysis and debugging.
        """
        intermediates = {"input": x.detach().clone()}
        activation_fn = self._get_activation_fn()
        
        for i, layer in enumerate(self.layers[:-1]):
            z = layer(x)  # Pre-activation
            x = activation_fn(z)  # Post-activation
            intermediates[f"layer_{i}_pre"] = z.detach()
            intermediates[f"layer_{i}_post"] = x.detach()
        
        # Final layer
        z = self.layers[-1](x)
        intermediates["output"] = z.detach()
        
        return z, intermediates
 
 
def demonstrate_forward_pass():
    """Demonstrate forward pass with analysis."""
    
    # Create model
    model = MLPWithHooks([784, 256, 128, 64, 10], activation="relu")
    
    # Random batch
    batch_size = 32
    x = torch.randn(batch_size, 784)
    
    # Forward pass
    output = model(x)
    
    print("Forward Pass Analysis")
    print("=" * 50)
    print(f"Input shape: {x.shape}")
    print(f"Output shape: {output.shape}")
    
    print("
Layer-wise activation statistics:")
    for name, activation in model.activations.items():
        mean = activation.mean().item()
        std = activation.std().item()
        sparsity = (activation == 0).float().mean().item()
        print(f"  {name}: mean={mean:.4f}, std={std:.4f}, sparsity={sparsity:.2%}")
    
    # Forward with intermediates
    output, intermediates = model.forward_with_intermediates(x)
    
    print("
Intermediate tensor shapes:")
    for name, tensor in intermediates.items():
        print(f"  {name}: {tensor.shape}")
    
    # Verify gradient flow
    print("
Gradient flow test:")
    loss = output.sum()
    loss.backward()
    
    for i, layer in enumerate(model.layers):
        grad_norm = layer.weight.grad.norm().item()
        print(f"  Layer {i} weight gradient norm: {grad_norm:.4f}")
 
 
if __name__ == "__main__":
    demonstrate_forward_pass()

Autograd: The Magic Behind the Scenes

Summary: Forward Propagation Mastery

Forward propagation is the fundamental computation that makes neural networks function. Every prediction, whether during training or inference, executes this algorithm.

Key Concepts Mastered

•The algorithm: Layer-by-layer computation of z = Wa + b, then a = σ(z), from input to output.
•Information flow: Each layer transforms representations via affine transformation + nonlinearity, progressively shaping data for the task.
•Computational complexity: O(∑ nₗnₗ₋₁) FLOPs per sample, dominated by matrix multiplications.
•Numerical stability: Stable sigmoid, softmax, and log-softmax formulations prevent overflow and preserve precision.
•Batch processing: Processing multiple samples simultaneously via matrix-matrix operations dramatically improves hardware efficiency.
•Framework integration: Modern frameworks automate forward pass and build computation graphs for automatic differentiation.

What's Next:

Page Complete

3 / 5