Loading content...
With computational graphs established as our representational framework, we now turn to the mechanics of forward pass computation—the process by which input data flows through the graph to produce outputs. The forward pass is conceptually simple: evaluate nodes in an order that respects dependencies. Yet within this simplicity lies considerable depth.
Understanding the forward pass thoroughly is essential because:
In this section, we'll develop a complete understanding of forward pass computation, from the mathematical foundations through practical implementation in modern frameworks.
By the end of this page, you'll understand: (1) The formal procedure for forward pass evaluation, (2) Why topological ordering is fundamental, (3) Memory allocation strategies during forward computation, (4) Lazy vs. eager evaluation trade-offs, (5) How intermediate values are cached for backward pass, and (6) Vectorization and batching in forward computation.
The forward pass is the evaluation of a computational graph from inputs to outputs. Given input values, we compute the value at every node until we reach the outputs (typically a loss value during training, or predictions during inference).
Formal Definition:
Given a computational graph $G = (V, E)$ with:
The forward pass computes values $v_{\text{val}}$ for all $v \in V$ such that:
$$v_{\text{val}} = \begin{cases} \text{input value} & \text{if } v \in V_{\text{input}} \ \phi_v(u_1^{\text{val}}, u_2^{\text{val}}, ..., u_k^{\text{val}}) & \text{otherwise, where } u_i \text{ are parents of } v \end{cases}$$
The critical constraint is respecting dependencies: we can only compute $v_{\text{val}}$ after computing $u_i^{\text{val}}$ for all parents $u_i$ of $v$.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
def forward_pass(graph, input_values): """ Execute forward pass on a computational graph. Args: graph: Computational graph with nodes and edges input_values: Dict mapping input node names to values Returns: Dict mapping all node names to computed values """ values = {} # Step 1: Initialize input values for node_name, value in input_values.items(): values[node_name] = value # Step 2: Get topological ordering topo_order = topological_sort(graph) # Step 3: Evaluate each node in order for node in topo_order: if node.name in values: # Already has value (input node) continue # Get parent values parent_values = [values[p.name] for p in node.parents] # Apply operation values[node.name] = node.operation(*parent_values) # Optional: Store value in node for backward pass node.cached_value = values[node.name] return values def topological_sort(graph): """ Return nodes in topological order (Kahn's algorithm). """ in_degree = {node: len(node.parents) for node in graph.nodes} queue = [node for node in graph.nodes if in_degree[node] == 0] result = [] while queue: node = queue.pop(0) result.append(node) for child in node.children: in_degree[child] -= 1 if in_degree[child] == 0: queue.append(child) return resultAny topological ordering is valid—there may be many. What matters is the guarantee: when we compute a node, all its inputs are already available. Without this ordering, we'd face undefined values. The DAG property ensures at least one valid ordering exists.
Let's trace forward passes through increasingly complex examples to build intuition.
$$f(x, y) = (x + y) \cdot x$$
with $x = 3$, $y = 2$.
Graph structure:
Forward pass trace:
| Step | Node | Operation | Parent Values | Result |
|---|---|---|---|---|
| 1 | A | input | — | 3 |
| 2 | B | input | — | 2 |
| 3 | C | add | A=3, B=2 | 5 |
| 4 | D | mul | C=5, A=3 | 15 |
Note that node A is used twice (fan-out). The forward pass simply uses its value wherever needed; the complexity of fan-out only manifests during the backward pass.
Consider a simple MLP with one hidden layer:
$$\mathbf{h} = \sigma(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1)$$ $$\mathbf{y} = \mathbf{W}_2 \mathbf{h} + \mathbf{b}_2$$
where $\sigma$ is the ReLU activation.
Let's trace with concrete values:
1234567891011121314151617181920212223242526272829303132333435363738394041
import numpy as np # Inputsx = np.array([[1.0], [2.0]]) # Shape: (2, 1) # Parameters (layer 1)W1 = np.array([[0.5, -0.3], # Shape: (3, 2) [0.2, 0.8], [-0.1, 0.4]])b1 = np.array([[0.1], [0.0], [-0.2]]) # Shape: (3, 1) # Parameters (layer 2)W2 = np.array([[0.6, -0.2, 0.5]]) # Shape: (1, 3)b2 = np.array([[0.1]]) # Shape: (1, 1) # Forward pass - step by stepprint("=== Forward Pass Trace ===") # Step 1: Linear transformation (layer 1)z1 = W1 @ x + b1print(f"z1 = W1 @ x + b1:")print(f" W1 @ x = {(W1 @ x).flatten()}") # [-0.1, 1.8, 0.7]print(f" + b1 = {z1.flatten()}") # [0.0, 1.8, 0.5] # Step 2: ReLU activationh = np.maximum(0, z1)print(f"h = ReLU(z1) = {h.flatten()}") # [0.0, 1.8, 0.5] # Step 3: Linear transformation (layer 2)z2 = W2 @ h + b2print(f"z2 = W2 @ h + b2:")print(f" W2 @ h = {(W2 @ h).flatten()}") # [-0.36 + 0.25] = [-0.11]print(f" + b2 = {z2.flatten()}") # [-0.01] print(f"Final output y = {z2.flatten()[0]:.4f}") # The graph traced:# x → [matmul with W1] → [add b1] → z1 → [ReLU] → h → [matmul with W2] → [add b2] → yShape tracking is essential: Each operation must receive correctly-shaped inputs. Matrix multiplication $(n \times k) \cdot (k \times m) \rightarrow (n \times m)$ has strict shape requirements.
Intermediate values must persist: After computing $\mathbf{z}_1$, we still need it for computing ReLU. After computing $\mathbf{h}$, we need it for layer 2. These values occupy memory throughout the forward pass.
Operations are atomically simple: Each step is a basic operation (matmul, add, elementwise max). The complexity emerges from composition.
Order is determined by dependencies: We couldn't compute $\mathbf{h}$ before $\mathbf{z}_1$, or $\mathbf{y}$ before $\mathbf{h}$. The data dependencies dictate order.
The forward pass creates numerous intermediate tensors, and managing memory efficiently is critical for training large models. Understanding memory patterns helps debug out-of-memory errors and optimize training.
During a forward pass, memory usage follows a characteristic pattern:
For a linear sequence of $n$ layers with activation size $a$, naive memory usage is $O(n \cdot a)$ because all intermediate activations must be retained for the backward pass.
Not all forward pass values are needed for the backward pass—only those that appear in gradient computations. Consider the operations:
| Operation | Forward | Backward needs |
|---|---|---|
| $y = x + z$ | Sum values | Nothing (gradient just passes through) |
| $y = x \cdot z$ | Product values | Both $x$ and $z$ (product rule) |
| $y = \text{ReLU}(x)$ | Max with 0 | The mask $x > 0$ (not the actual values) |
| $y = \sigma(x)$ sigmoid | $\frac{1}{1+e^{-x}}$ | The output $y$ (since $\sigma' = \sigma(1-\sigma)$) |
| $y = Wx$ | Matrix multiply | Both $W$ and $x$ |
Training deep networks faces a fundamental tradeoff: store intermediate values (using memory) or recompute them during backward pass (using computation). Gradient checkpointing strategically recomputes some values to reduce peak memory, at the cost of ~33% more computation.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
class MemoryEfficientNode: """Node that stores minimal information for backward pass.""" def forward_and_cache(self, *inputs): """ Compute forward value and cache ONLY what's needed for backward. """ raise NotImplementedError class ReLUNode(MemoryEfficientNode): def forward_and_cache(self, x): # Only cache the mask, not the full input self.cached_mask = x > 0 # Boolean tensor (1/8 size of float32) return np.maximum(0, x) def backward(self, grad_output): # Use cached mask return grad_output * self.cached_mask class MatMulNode(MemoryEfficientNode): def forward_and_cache(self, x, W): # Must cache both operands for backward self.cached_x = x self.cached_W = W return W @ x def backward(self, grad_output): # ∂L/∂W = grad_output @ x^T # ∂L/∂x = W^T @ grad_output grad_W = grad_output @ self.cached_x.T grad_x = self.cached_W.T @ grad_output return grad_x, grad_W class SigmoidNode(MemoryEfficientNode): def forward_and_cache(self, x): # Cache output (not input) - sufficient for sigmoid derivative self.cached_output = 1.0 / (1.0 + np.exp(-x)) return self.cached_output def backward(self, grad_output): # σ'(x) = σ(x)(1 - σ(x)) = cached_output * (1 - cached_output) return grad_output * self.cached_output * (1 - self.cached_output) # Memory analysis for a transformer layer (approximate)# Batch=32, SeqLen=512, HiddenDim=768, Heads=12## Self-attention activations to cache:# - Q, K, V projections: 3 * 32 * 512 * 768 * 4 bytes = 144 MB# - Attention weights: 32 * 12 * 512 * 512 * 4 bytes = 384 MB# - Attention output: 32 * 512 * 768 * 4 bytes = 48 MB# # Just one transformer layer: ~600-800 MB of cached activations# A 12-layer model: ~7-10 GB just for activation caching!1. In-place Operations: Some operations can modify tensors in-place, avoiding allocation:
# Instead of: y = x + 1 (allocates new tensor)
# Use: x += 1 (modifies in-place)
However, in-place operations can break gradient computation if the original value is needed!
2. Memory Pooling: Frameworks often maintain memory pools, reusing allocations across iterations rather than repeatedly allocating and freeing.
3. Mixed Precision: Using float16 instead of float32 halves memory for activations. Modern GPUs have specialized hardware (Tensor Cores) that accelerate float16 computation.
4. Gradient Checkpointing: Strategically discard intermediate activations and recompute them during backward pass:
A critical performance optimization in forward passes is batching—processing multiple examples simultaneously. Modern hardware (GPUs, TPUs) is designed for parallel computation, and batching exploits this parallelism.
Consider a matrix-vector multiplication for a single example: $$\mathbf{y} = \mathbf{W} \mathbf{x}, \quad \mathbf{W} \in \mathbb{R}^{m \times n}, \mathbf{x} \in \mathbb{R}^{n}$$
Computing this once: $O(mn)$ operations, $O(mn)$ memory accesses.
Now consider batching $B$ examples: $$\mathbf{Y} = \mathbf{W} \mathbf{X}, \quad \mathbf{X} \in \mathbb{R}^{n \times B}, \mathbf{Y} \in \mathbb{R}^{m \times B}$$
Computing this: $O(mnB)$ operations, but memory access for $\mathbf{W}$ is amortized across $B$ examples!
| Batch Size | Throughput (examples/sec) | GPU Utilization | Memory Usage |
|---|---|---|---|
| 1 | 100 | 5% | Low |
| 16 | 1,200 | 40% | Moderate |
| 64 | 3,500 | 75% | High |
| 256 | 5,000 | 90% | Very High |
| 1024 | 5,200 | 95% | Maximum |
By convention, tensors in deep learning have specific dimension orderings:
| Tensor Type | Convention | Example Shape |
|---|---|---|
| Dense features | (Batch, Features) | (32, 784) |
| Images (PyTorch) | (Batch, Channels, Height, Width) | (32, 3, 224, 224) |
| Images (TensorFlow) | (Batch, Height, Width, Channels) | (32, 224, 224, 3) |
| Sequences | (Batch, SeqLen, Features) | (32, 128, 512) |
| Sequences (packed) | (Batch, SeqLen, Features) or (SeqLen, Batch, Features) | varies |
The batch dimension is typically the first or second dimension, enabling efficient memory layout for parallel processing.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
import numpy as npimport time def unbatched_forward(X, W1, b1, W2, b2): """Process examples one at a time.""" results = [] for x in X: x = x.reshape(-1, 1) # Column vector h = np.maximum(0, W1 @ x + b1) y = W2 @ h + b2 results.append(y.flatten()) return np.array(results) def batched_forward(X, W1, b1, W2, b2): """Process all examples simultaneously.""" # X shape: (batch, features) -> (features, batch) for matmul X_T = X.T h = np.maximum(0, W1 @ X_T + b1) y = W2 @ h + b2 return y.T # Back to (batch, outputs) # Benchmarknp.random.seed(42)batch_size = 1000input_dim, hidden_dim, output_dim = 784, 256, 10 X = np.random.randn(batch_size, input_dim).astype(np.float32)W1 = np.random.randn(hidden_dim, input_dim).astype(np.float32) * 0.01b1 = np.zeros((hidden_dim, 1), dtype=np.float32)W2 = np.random.randn(output_dim, hidden_dim).astype(np.float32) * 0.01b2 = np.zeros((output_dim, 1), dtype=np.float32) # Time unbatchedstart = time.time()y_unbatched = unbatched_forward(X, W1, b1, W2, b2)unbatched_time = time.time() - start # Time batchedstart = time.time()y_batched = batched_forward(X, W1, b1, W2, b2)batched_time = time.time() - start print(f"Unbatched: {unbatched_time*1000:.2f} ms")print(f"Batched: {batched_time*1000:.2f} ms")print(f"Speedup: {unbatched_time/batched_time:.1f}x")print(f"Results match: {np.allclose(y_unbatched, y_batched)}") # Typical output (CPU):# Unbatched: 125.32 ms# Batched: 3.45 ms# Speedup: 36.3xModern forward passes exploit multiple levels of parallelism: (1) Batch parallelism — different examples processed simultaneously, (2) Operation parallelism — independent operations computed concurrently, (3) SIMD parallelism — single instruction applied to multiple data elements, (4) GPU thread parallelism — thousands of threads execute concurrently. Effective batching enables all of these.
Computational graph frameworks differ fundamentally in when they execute operations. This distinction—eager versus lazy (deferred) evaluation—profoundly affects usability, debugging, and optimization potential.
In eager mode, operations execute immediately when called. Each line of code produces an actual computed value.
# Eager mode (PyTorch, TensorFlow eager)
x = torch.tensor([1.0, 2.0])
y = x * 2 # Computation happens HERE
print(y) # tensor([2., 4.]) - actual values available
Advantages:
Disadvantages:
In lazy mode, operations build a symbolic graph without executing. Computation occurs only when explicitly requested (e.g., session.run() in TF 1.x, or when a lazy tensor is materialized).
# Lazy mode (TensorFlow 1.x style)
x = tf.placeholder(tf.float32, shape=[2])
y = x * 2 # No computation - just graph building
# y is a symbolic tensor, not actual values
with tf.Session() as sess:
result = sess.run(y, feed_dict={x: [1.0, 2.0]}) # NOW it computes
print(result) # [2., 4.]
Advantages:
Disadvantages:
tf.cond, tf.while_loop)Modern frameworks blend both approaches:
TensorFlow 2 + tf.function:
Default is eager, but decorating a function with @tf.function traces it into a graph:
@tf.function # Traces to graph for optimization
def forward(x, W):
return tf.nn.relu(tf.matmul(x, W))
y = forward(data, weights) # First call: trace graph. Subsequent: fast execution
JAX:
Default is eager, but jax.jit compiles functions:
@jax.jit # JIT compile with XLA
def forward(x, W):
return jax.nn.relu(x @ W)
PyTorch + TorchScript:
scripted_model = torch.jit.script(model) # Convert to TorchScript graph
This hybrid approach gives developers eager mode for development and debugging, with lazy/compiled mode for production performance.
Forward pass computations operate in floating-point arithmetic, which introduces numerical considerations that can affect both correctness and performance.
Deep learning frameworks support multiple precision levels:
| Precision | Bits | Range | Significant Digits | Use Case |
|---|---|---|---|---|
| float64 (double) | 64 | ±10³⁰⁸ | 15-16 | Scientific computing, rarely in DL |
| float32 (single) | 32 | ±10³⁸ | 6-7 | Default for most DL training |
| float16 (half) | 16 | ±65504 | 3-4 | GPU-accelerated training with loss scaling |
| bfloat16 | 16 | ±10³⁸ | 2-3 | TPUs, broader range than float16 |
| int8 | 8 | -128 to 127 | N/A | Inference quantization |
The choice of precision involves tradeoffs between accuracy, memory, and computation speed.
Certain operations are numerically unstable in naive implementations: (1) Softmax with large logits → overflow , (2) Log of small probabilities → underflow, (3) Division by values near zero → explosion, (4) Exponentials of large values → overflow. Frameworks often provide numerically stable compound operations (log_softmax, binary_cross_entropy_with_logits) that avoid these issues.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
import numpy as np def softmax_naive(logits): """Naive softmax - numerically unstable!""" exp_logits = np.exp(logits) # Can overflow for large logits return exp_logits / np.sum(exp_logits) def softmax_stable(logits): """Stable softmax using log-sum-exp trick.""" # Subtract max for numerical stability shifted = logits - np.max(logits) exp_shifted = np.exp(shifted) # Now safe: max value is exp(0) = 1 return exp_shifted / np.sum(exp_shifted) def log_softmax_stable(logits): """Stable log-softmax - avoids log of small numbers.""" max_logit = np.max(logits) # log(softmax(x)) = x - max(x) - log(sum(exp(x - max(x)))) shifted = logits - max_logit log_sum_exp = np.log(np.sum(np.exp(shifted))) return shifted - log_sum_exp # Demonstrationlarge_logits = np.array([1000.0, 1001.0, 1002.0]) print("Testing with large logits:", large_logits)try: naive_result = softmax_naive(large_logits) print(f"Naive softmax: {naive_result}")except RuntimeWarning as e: print(f"Naive softmax WARNING: overflow occurred") stable_result = softmax_stable(large_logits)print(f"Stable softmax: {stable_result}") log_sm = log_softmax_stable(large_logits)print(f"Log-softmax: {log_sm}") # Cross-entropy from log-softmax (stable)target = 2 # True classloss = -log_sm[target]print(f"Cross-entropy loss: {loss}") # Output:# Testing with large logits: [1000. 1001. 1002.]# Stable softmax: [0.09003057 0.24472847 0.66524096]# Log-softmax: [-2.40760596 -1.40760596 -0.40760596]# Cross-entropy loss: 0.407605958...Modern training often uses mixed precision: most operations in float16 for speed, with critical operations in float32 for stability.
The technique:
Loss scaling: To prevent gradient underflow in float16, multiply loss by a scale factor before backward pass, then divide gradients by the same factor before update.
# Mixed precision training (conceptual)
with autocast(): # Context for automatic mixed precision
outputs = model(inputs) # float16 forward
loss = criterion(outputs, targets)
scaler.scale(loss).backward() # Scaled float16 backward
scaler.step(optimizer) # Unscale + update
scaler.update() # Adjust scale factor
Mixed precision can provide 2-3x speedup on modern GPUs with minimal accuracy loss.
Let's see how forward passes work in practice across major frameworks, understanding their idioms and optimizations.
PyTorch uses eager execution with dynamic graphs. The forward pass is accomplished by calling the model as a function:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
import torchimport torch.nn as nnimport torch.nn.functional as F class ConvNet(nn.Module): def __init__(self, num_classes=10): super().__init__() self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1) self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1) self.pool = nn.MaxPool2d(2, 2) self.fc1 = nn.Linear(64 * 8 * 8, 256) self.fc2 = nn.Linear(256, num_classes) self.dropout = nn.Dropout(0.5) def forward(self, x): # Forward pass defined explicitly # x: (batch, 3, 32, 32) x = F.relu(self.conv1(x)) # (batch, 32, 32, 32) x = self.pool(x) # (batch, 32, 16, 16) x = F.relu(self.conv2(x)) # (batch, 64, 16, 16) x = self.pool(x) # (batch, 64, 8, 8) x = x.view(x.size(0), -1) # Flatten: (batch, 64*8*8) x = F.relu(self.fc1(x)) # (batch, 256) x = self.dropout(x) x = self.fc2(x) # (batch, num_classes) return x # Using the modelmodel = ConvNet()images = torch.randn(32, 3, 32, 32) # Batch of 32 images # Forward pass - graph built during this callwith torch.no_grad(): # No gradients needed for inference logits = model(images) print(f"Output shape: {logits.shape}") # torch.Size([32, 10]) # For training, enable gradientslogits = model(images) # Graph ready for backproploss = F.cross_entropy(logits, torch.randint(0, 10, (32,)))loss.backward() # Backward pass on the graph built during forwardTensorFlow 2 with Keras offers a similar experience with eager execution by default:
12345678910111213141516171819202122232425262728293031323334353637383940
import tensorflow as tffrom tensorflow import keras # Define model using Keras Sequential APImodel = keras.Sequential([ keras.layers.Conv2D(32, 3, padding='same', activation='relu', input_shape=(32, 32, 3)), keras.layers.MaxPooling2D(), keras.layers.Conv2D(64, 3, padding='same', activation='relu'), keras.layers.MaxPooling2D(), keras.layers.Flatten(), keras.layers.Dense(256, activation='relu'), keras.layers.Dropout(0.5), keras.layers.Dense(10)]) # Forward pass (eager by default in TF 2)images = tf.random.normal((32, 32, 32, 3))logits = model(images, training=False) # training=False disables dropout print(f"Output shape: {logits.shape}") # (32, 10) # Compiled forward pass for performance@tf.function # JIT compile for faster executiondef optimized_forward(model, images): return model(images, training=False) # First call traces the graph, subsequent calls use compiled versionlogits_fast = optimized_forward(model, images) # Training with GradientTapewith tf.GradientTape() as tape: logits = model(images, training=True) # training=True enables dropout loss = keras.losses.sparse_categorical_crossentropy( tf.random.uniform((32,), 0, 10, dtype=tf.int32), logits, from_logits=True ) loss = tf.reduce_mean(loss) gradients = tape.gradient(loss, model.trainable_variables)Many layers behave differently during training and inference. Dropout randomly zeros neurons during training but is disabled during inference. BatchNorm uses batch statistics during training but stored running statistics during inference. Always explicitly set the mode (training=True/False in TensorFlow, model.train()/model.eval() in PyTorch).
We've thoroughly explored forward pass computation—the first half of neural network training and the entirety of inference. Let's consolidate the key insights:
With forward pass computation understood, we're ready for the more sophisticated reverse mode automatic differentiation (backpropagation):
The forward pass provides values; the backward pass provides gradients. Together, they enable training neural networks of any architecture—the remarkable capability that powers all of modern deep learning.
You now have a comprehensive understanding of forward pass computation in neural networks. You understand how values propagate through computational graphs, the importance of topological ordering, memory management challenges, batching for performance, and the distinction between eager and lazy evaluation. Next, we'll explore how gradients flow in the opposite direction during backpropagation.