Computational Graphs - Learning Module

Loading content...

0/245

Forward Pass Computation

From Representation to Computation

With computational graphs established as our representational framework, we now turn to the mechanics of forward pass computation—the process by which input data flows through the graph to produce outputs. The forward pass is conceptually simple: evaluate nodes in an order that respects dependencies. Yet within this simplicity lies considerable depth.

Understanding the forward pass thoroughly is essential because:

It's half of training: Every gradient computation requires a forward pass first
Inference is pure forward pass: Deployed models execute only forward passes
Performance optimization starts here: Forward pass efficiency directly impacts training speed and inference latency
Debugging requires tracing: Understanding how values propagate reveals where computations go wrong

In this section, we'll develop a complete understanding of forward pass computation, from the mathematical foundations through practical implementation in modern frameworks.

What You Will Learn

By the end of this page, you'll understand: (1) The formal procedure for forward pass evaluation, (2) Why topological ordering is fundamental, (3) Memory allocation strategies during forward computation, (4) Lazy vs. eager evaluation trade-offs, (5) How intermediate values are cached for backward pass, and (6) Vectorization and batching in forward computation.

The Forward Pass Procedure

The forward pass is the evaluation of a computational graph from inputs to outputs. Given input values, we compute the value at every node until we reach the outputs (typically a loss value during training, or predictions during inference).

Formal Definition:

Given a computational graph $G = (V, E)$ with:

Input nodes $V_{\text{input}}$ with assigned values
Operation nodes $V_{\text{op}}$ with associated functions $\phi_v$
Edges encoding dependencies

The forward pass computes values $v_{\text{val}}$ for all $v \in V$ such that:

$$v_{\text{val}} = \begin{cases} \text{input value} & \text{if } v \in V_{\text{input}} \ \phi_v(u_1^{\text{val}}, u_2^{\text{val}}, ..., u_k^{\text{val}}) & \text{otherwise, where } u_i \text{ are parents of } v \end{cases}$$

The critical constraint is respecting dependencies: we can only compute $v_{\text{val}}$ after computing $u_i^{\text{val}}$ for all parents $u_i$ of $v$.

Forward Pass Algorithm

•Initialize inputs: Assign input values to all input nodes
•Topological sort: Arrange all nodes in an order where each node appears after all its parents
•Sequential evaluation: For each node in topological order, compute its value from parent values
•Cache intermediate values: Store computed values for (a) downstream computations and (b) backward pass
•Return outputs: The final nodes contain the output values

forward_pass_basic.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
def forward_pass(graph, input_values):
    """
    Execute forward pass on a computational graph.
    
    Args:
        graph: Computational graph with nodes and edges
        input_values: Dict mapping input node names to values
        
    Returns:
        Dict mapping all node names to computed values
    """
    values = {}
    
    # Step 1: Initialize input values
    for node_name, value in input_values.items():
        values[node_name] = value
    
    # Step 2: Get topological ordering
    topo_order = topological_sort(graph)
    
    # Step 3: Evaluate each node in order
    for node in topo_order:
        if node.name in values:
            # Already has value (input node)
            continue
            
        # Get parent values
        parent_values = [values[p.name] for p in node.parents]
        
        # Apply operation
        values[node.name] = node.operation(*parent_values)
        
        # Optional: Store value in node for backward pass
        node.cached_value = values[node.name]
    
    return values
 
def topological_sort(graph):
    """
    Return nodes in topological order (Kahn's algorithm).
    """
    in_degree = {node: len(node.parents) for node in graph.nodes}
    queue = [node for node in graph.nodes if in_degree[node] == 0]
    result = []
    
    while queue:
        node = queue.pop(0)
        result.append(node)
        for child in node.children:
            in_degree[child] -= 1
            if in_degree[child] == 0:
                queue.append(child)
    
    return result

Why Topological Order Matters

Any topological ordering is valid—there may be many. What matters is the guarantee: when we compute a node, all its inputs are already available. Without this ordering, we'd face undefined values. The DAG property ensures at least one valid ordering exists.

Tracing Through Concrete Examples

Let's trace forward passes through increasingly complex examples to build intuition.

Example 1: Simple Arithmetic Expression

$$f(x, y) = (x + y) \cdot x$$

with $x = 3$, $y = 2$.

Graph structure:

Node A: Input $x$
Node B: Input $y$
Node C: $A + B$ (addition)
Node D: $C \cdot A$ (multiplication)

Forward pass trace:

Step	Node	Operation	Parent Values	Result
1	A	input	—	3
2	B	input	—	2
3	C	add	A=3, B=2	5
4	D	mul	C=5, A=3	15

Note that node A is used twice (fan-out). The forward pass simply uses its value wherever needed; the complexity of fan-out only manifests during the backward pass.

Converting Mermaid diagram...

Example 2: Multi-Layer Perceptron Forward Pass

Consider a simple MLP with one hidden layer:

$$\mathbf{h} = \sigma(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1)$$ $$\mathbf{y} = \mathbf{W}_2 \mathbf{h} + \mathbf{b}_2$$

where $\sigma$ is the ReLU activation.

Let's trace with concrete values:

mlp_forward_trace.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import numpy as np
 
# Inputs
x = np.array([[1.0], [2.0]])  # Shape: (2, 1)
 
# Parameters (layer 1)
W1 = np.array([[0.5, -0.3],   # Shape: (3, 2)
               [0.2,  0.8],
               [-0.1, 0.4]])
b1 = np.array([[0.1], [0.0], [-0.2]])  # Shape: (3, 1)
 
# Parameters (layer 2)
W2 = np.array([[0.6, -0.2, 0.5]])  # Shape: (1, 3)
b2 = np.array([[0.1]])  # Shape: (1, 1)
 
# Forward pass - step by step
print("=== Forward Pass Trace ===")
 
# Step 1: Linear transformation (layer 1)
z1 = W1 @ x + b1
print(f"z1 = W1 @ x + b1:")
print(f"   W1 @ x = {(W1 @ x).flatten()}")  # [-0.1, 1.8, 0.7]
print(f"   + b1   = {z1.flatten()}")         # [0.0, 1.8, 0.5]
 
# Step 2: ReLU activation
h = np.maximum(0, z1)
print(f"
h = ReLU(z1) = {h.flatten()}")     # [0.0, 1.8, 0.5]
 
# Step 3: Linear transformation (layer 2)
z2 = W2 @ h + b2
print(f"
z2 = W2 @ h + b2:")
print(f"   W2 @ h = {(W2 @ h).flatten()}")  # [-0.36 + 0.25] = [-0.11]
print(f"   + b2   = {z2.flatten()}")         # [-0.01]
 
print(f"
Final output y = {z2.flatten()[0]:.4f}")
 
# The graph traced:
# x → [matmul with W1] → [add b1] → z1 → [ReLU] → h → [matmul with W2] → [add b2] → y

Key Observations from Tracing

Shape tracking is essential: Each operation must receive correctly-shaped inputs. Matrix multiplication $(n \times k) \cdot (k \times m) \rightarrow (n \times m)$ has strict shape requirements.
Intermediate values must persist: After computing $\mathbf{z}_1$, we still need it for computing ReLU. After computing $\mathbf{h}$, we need it for layer 2. These values occupy memory throughout the forward pass.
Operations are atomically simple: Each step is a basic operation (matmul, add, elementwise max). The complexity emerges from composition.
Order is determined by dependencies: We couldn't compute $\mathbf{h}$ before $\mathbf{z}_1$, or $\mathbf{y}$ before $\mathbf{h}$. The data dependencies dictate order.

Memory Management During Forward Pass

The forward pass creates numerous intermediate tensors, and managing memory efficiently is critical for training large models. Understanding memory patterns helps debug out-of-memory errors and optimize training.

Memory Timeline

During a forward pass, memory usage follows a characteristic pattern:

Allocation phase: As we compute each node, new memory is allocated for results
Peak usage: Occurs when many intermediate values exist simultaneously
Retention for backward: Values needed for gradients must persist
Release after backward: Memory freed when gradients are computed

For a linear sequence of $n$ layers with activation size $a$, naive memory usage is $O(n \cdot a)$ because all intermediate activations must be retained for the backward pass.

What Gets Cached and Why

Not all forward pass values are needed for the backward pass—only those that appear in gradient computations. Consider the operations:

Operation	Forward	Backward needs
$y = x + z$	Sum values	Nothing (gradient just passes through)
$y = x \cdot z$	Product values	Both $x$ and $z$ (product rule)
$y = \text{ReLU}(x)$	Max with 0	The mask $x > 0$ (not the actual values)
$y = \sigma(x)$ sigmoid	$\frac{1}{1+e^{-x}}$	The output $y$ (since $\sigma' = \sigma(1-\sigma)$)
$y = Wx$	Matrix multiply	Both $W$ and $x$

Memory-Computation Tradeoff

Training deep networks faces a fundamental tradeoff: store intermediate values (using memory) or recompute them during backward pass (using computation). Gradient checkpointing strategically recomputes some values to reduce peak memory, at the cost of ~33% more computation.

memory_aware_forward.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
class MemoryEfficientNode:
    """Node that stores minimal information for backward pass."""
    
    def forward_and_cache(self, *inputs):
        """
        Compute forward value and cache ONLY what's needed for backward.
        """
        raise NotImplementedError
 
class ReLUNode(MemoryEfficientNode):
    def forward_and_cache(self, x):
        # Only cache the mask, not the full input
        self.cached_mask = x > 0  # Boolean tensor (1/8 size of float32)
        return np.maximum(0, x)
    
    def backward(self, grad_output):
        # Use cached mask
        return grad_output * self.cached_mask
 
class MatMulNode(MemoryEfficientNode):
    def forward_and_cache(self, x, W):
        # Must cache both operands for backward
        self.cached_x = x
        self.cached_W = W
        return W @ x
    
    def backward(self, grad_output):
        # ∂L/∂W = grad_output @ x^T
        # ∂L/∂x = W^T @ grad_output
        grad_W = grad_output @ self.cached_x.T
        grad_x = self.cached_W.T @ grad_output
        return grad_x, grad_W
 
class SigmoidNode(MemoryEfficientNode):
    def forward_and_cache(self, x):
        # Cache output (not input) - sufficient for sigmoid derivative
        self.cached_output = 1.0 / (1.0 + np.exp(-x))
        return self.cached_output
    
    def backward(self, grad_output):
        # σ'(x) = σ(x)(1 - σ(x)) = cached_output * (1 - cached_output)
        return grad_output * self.cached_output * (1 - self.cached_output)
 
# Memory analysis for a transformer layer (approximate)
# Batch=32, SeqLen=512, HiddenDim=768, Heads=12
#
# Self-attention activations to cache:
#   - Q, K, V projections: 3 * 32 * 512 * 768 * 4 bytes = 144 MB
#   - Attention weights: 32 * 12 * 512 * 512 * 4 bytes = 384 MB
#   - Attention output: 32 * 512 * 768 * 4 bytes = 48 MB
# 
# Just one transformer layer: ~600-800 MB of cached activations
# A 12-layer model: ~7-10 GB just for activation caching!

Memory Optimization Strategies

1. In-place Operations: Some operations can modify tensors in-place, avoiding allocation:

# Instead of: y = x + 1  (allocates new tensor)
# Use:        x += 1     (modifies in-place)

However, in-place operations can break gradient computation if the original value is needed!

2. Memory Pooling: Frameworks often maintain memory pools, reusing allocations across iterations rather than repeatedly allocating and freeing.

3. Mixed Precision: Using float16 instead of float32 halves memory for activations. Modern GPUs have specialized hardware (Tensor Cores) that accelerate float16 computation.

4. Gradient Checkpointing: Strategically discard intermediate activations and recompute them during backward pass:

Divide network into segments
Only keep activations at segment boundaries
Recompute within-segment activations during backward
Trades ~33% extra compute for O(√n) instead of O(n) memory

Vectorization and Batching

A critical performance optimization in forward passes is batching—processing multiple examples simultaneously. Modern hardware (GPUs, TPUs) is designed for parallel computation, and batching exploits this parallelism.

Why Batching Matters

Consider a matrix-vector multiplication for a single example: $$\mathbf{y} = \mathbf{W} \mathbf{x}, \quad \mathbf{W} \in \mathbb{R}^{m \times n}, \mathbf{x} \in \mathbb{R}^{n}$$

Computing this once: $O(mn)$ operations, $O(mn)$ memory accesses.

Now consider batching $B$ examples: $$\mathbf{Y} = \mathbf{W} \mathbf{X}, \quad \mathbf{X} \in \mathbb{R}^{n \times B}, \mathbf{Y} \in \mathbb{R}^{m \times B}$$

Computing this: $O(mnB)$ operations, but memory access for $\mathbf{W}$ is amortized across $B$ examples!

Batching Performance Impact
Batch Size	Throughput (examples/sec)	GPU Utilization	Memory Usage
1	100	5%	Low
16	1,200	40%	Moderate
64	3,500	75%	High
256	5,000	90%	Very High
1024	5,200	95%	Maximum

The Batch Dimension Convention

By convention, tensors in deep learning have specific dimension orderings:

Tensor Type	Convention	Example Shape
Dense features	(Batch, Features)	(32, 784)
Images (PyTorch)	(Batch, Channels, Height, Width)	(32, 3, 224, 224)
Images (TensorFlow)	(Batch, Height, Width, Channels)	(32, 224, 224, 3)
Sequences	(Batch, SeqLen, Features)	(32, 128, 512)
Sequences (packed)	(Batch, SeqLen, Features) or (SeqLen, Batch, Features)	varies

The batch dimension is typically the first or second dimension, enabling efficient memory layout for parallel processing.

batched_forward.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
import time
 
def unbatched_forward(X, W1, b1, W2, b2):
    """Process examples one at a time."""
    results = []
    for x in X:
        x = x.reshape(-1, 1)  # Column vector
        h = np.maximum(0, W1 @ x + b1)
        y = W2 @ h + b2
        results.append(y.flatten())
    return np.array(results)
 
def batched_forward(X, W1, b1, W2, b2):
    """Process all examples simultaneously."""
    # X shape: (batch, features) -> (features, batch) for matmul
    X_T = X.T
    h = np.maximum(0, W1 @ X_T + b1)
    y = W2 @ h + b2
    return y.T  # Back to (batch, outputs)
 
# Benchmark
np.random.seed(42)
batch_size = 1000
input_dim, hidden_dim, output_dim = 784, 256, 10
 
X = np.random.randn(batch_size, input_dim).astype(np.float32)
W1 = np.random.randn(hidden_dim, input_dim).astype(np.float32) * 0.01
b1 = np.zeros((hidden_dim, 1), dtype=np.float32)
W2 = np.random.randn(output_dim, hidden_dim).astype(np.float32) * 0.01
b2 = np.zeros((output_dim, 1), dtype=np.float32)
 
# Time unbatched
start = time.time()
y_unbatched = unbatched_forward(X, W1, b1, W2, b2)
unbatched_time = time.time() - start
 
# Time batched
start = time.time()
y_batched = batched_forward(X, W1, b1, W2, b2)
batched_time = time.time() - start
 
print(f"Unbatched: {unbatched_time*1000:.2f} ms")
print(f"Batched:   {batched_time*1000:.2f} ms")
print(f"Speedup:   {unbatched_time/batched_time:.1f}x")
print(f"Results match: {np.allclose(y_unbatched, y_batched)}")
 
# Typical output (CPU):
# Unbatched: 125.32 ms
# Batched:   3.45 ms
# Speedup:   36.3x

The Parallelism Hierarchy

Modern forward passes exploit multiple levels of parallelism: (1) Batch parallelism — different examples processed simultaneously, (2) Operation parallelism — independent operations computed concurrently, (3) SIMD parallelism — single instruction applied to multiple data elements, (4) GPU thread parallelism — thousands of threads execute concurrently. Effective batching enables all of these.

Eager vs. Lazy Evaluation

Computational graph frameworks differ fundamentally in when they execute operations. This distinction—eager versus lazy (deferred) evaluation—profoundly affects usability, debugging, and optimization potential.

Eager Evaluation (Define-by-Run)

In eager mode, operations execute immediately when called. Each line of code produces an actual computed value.

# Eager mode (PyTorch, TensorFlow eager)
x = torch.tensor([1.0, 2.0])
y = x * 2  # Computation happens HERE
print(y)   # tensor([2., 4.]) - actual values available

Advantages:

Intuitive debugging: Insert print statements, use standard debuggers, inspect values at any point
Natural control flow: Python if/else and loops work normally
Immediate feedback: Errors surface at the offending line

Disadvantages:

Limited optimization: Framework can't see the full computation before executing
Operation overhead: Each operation has Python overhead
Memory inefficiency: Harder to plan memory allocation across operations

Lazy Evaluation (Define-and-Run)

In lazy mode, operations build a symbolic graph without executing. Computation occurs only when explicitly requested (e.g., session.run() in TF 1.x, or when a lazy tensor is materialized).

# Lazy mode (TensorFlow 1.x style)
x = tf.placeholder(tf.float32, shape=[2])
y = x * 2  # No computation - just graph building
# y is a symbolic tensor, not actual values

with tf.Session() as sess:
    result = sess.run(y, feed_dict={x: [1.0, 2.0]})  # NOW it computes
    print(result)  # [2., 4.]

Advantages:

Global optimization: Framework sees entire graph before execution, enabling fusion, reordering, parallelization
Deployment: Frozen graphs can be deployed without Python runtime
Hardware optimization: Graph can be compiled for specific accelerators (XLA, TensorRT)

Disadvantages:

Difficult debugging: Errors reference symbolic nodes, not source lines
Complex control flow: Requires special constructs (tf.cond, tf.while_loop)
Mental overhead: Must think symbolically, not imperatively

Eager Evaluation

•Operations execute immediately
•Easy debugging with print/debugger
•Natural Python control flow
•Graph built dynamically
•Less optimization opportunity
•Examples: PyTorch, TF 2 default, JAX eager

Lazy Evaluation

•Operations build symbolic graph
•Debugging requires graph inspection
•Special control flow primitives
•Graph built before execution
•Maximum optimization potential
•Examples: TF 1.x, JAX JIT, TorchScript

Modern Hybrid Approaches

Modern frameworks blend both approaches:

TensorFlow 2 + tf.function: Default is eager, but decorating a function with @tf.function traces it into a graph:

@tf.function  # Traces to graph for optimization
def forward(x, W):
    return tf.nn.relu(tf.matmul(x, W))

y = forward(data, weights)  # First call: trace graph. Subsequent: fast execution

JAX: Default is eager, but jax.jit compiles functions:

@jax.jit  # JIT compile with XLA
def forward(x, W):
    return jax.nn.relu(x @ W)

PyTorch + TorchScript:

scripted_model = torch.jit.script(model)  # Convert to TorchScript graph

This hybrid approach gives developers eager mode for development and debugging, with lazy/compiled mode for production performance.

Numerical Precision in Forward Computation

Forward pass computations operate in floating-point arithmetic, which introduces numerical considerations that can affect both correctness and performance.

Floating-Point Precision Levels

Deep learning frameworks support multiple precision levels:

Precision	Bits	Range	Significant Digits	Use Case
float64 (double)	64	±10³⁰⁸	15-16	Scientific computing, rarely in DL
float32 (single)	32	±10³⁸	6-7	Default for most DL training
float16 (half)	16	±65504	3-4	GPU-accelerated training with loss scaling
bfloat16	16	±10³⁸	2-3	TPUs, broader range than float16
int8	8	-128 to 127	N/A	Inference quantization

The choice of precision involves tradeoffs between accuracy, memory, and computation speed.

Numerical Stability Issues

Certain operations are numerically unstable in naive implementations: (1) Softmax with large logits → overflow , (2) Log of small probabilities → underflow, (3) Division by values near zero → explosion, (4) Exponentials of large values → overflow. Frameworks often provide numerically stable compound operations (log_softmax, binary_cross_entropy_with_logits) that avoid these issues.

numerical_stability.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import numpy as np
 
def softmax_naive(logits):
    """Naive softmax - numerically unstable!"""
    exp_logits = np.exp(logits)  # Can overflow for large logits
    return exp_logits / np.sum(exp_logits)
 
def softmax_stable(logits):
    """Stable softmax using log-sum-exp trick."""
    # Subtract max for numerical stability
    shifted = logits - np.max(logits)
    exp_shifted = np.exp(shifted)  # Now safe: max value is exp(0) = 1
    return exp_shifted / np.sum(exp_shifted)
 
def log_softmax_stable(logits):
    """Stable log-softmax - avoids log of small numbers."""
    max_logit = np.max(logits)
    # log(softmax(x)) = x - max(x) - log(sum(exp(x - max(x))))
    shifted = logits - max_logit
    log_sum_exp = np.log(np.sum(np.exp(shifted)))
    return shifted - log_sum_exp
 
# Demonstration
large_logits = np.array([1000.0, 1001.0, 1002.0])
 
print("Testing with large logits:", large_logits)
try:
    naive_result = softmax_naive(large_logits)
    print(f"Naive softmax: {naive_result}")
except RuntimeWarning as e:
    print(f"Naive softmax WARNING: overflow occurred")
 
stable_result = softmax_stable(large_logits)
print(f"Stable softmax: {stable_result}")
 
log_sm = log_softmax_stable(large_logits)
print(f"Log-softmax: {log_sm}")
 
# Cross-entropy from log-softmax (stable)
target = 2  # True class
loss = -log_sm[target]
print(f"Cross-entropy loss: {loss}")
 
# Output:
# Testing with large logits: [1000. 1001. 1002.]
# Stable softmax: [0.09003057 0.24472847 0.66524096]
# Log-softmax: [-2.40760596 -1.40760596 -0.40760596]
# Cross-entropy loss: 0.407605958...

Mixed Precision Training

Modern training often uses mixed precision: most operations in float16 for speed, with critical operations in float32 for stability.

The technique:

Store master weights in float32
Forward pass: cast weights to float16, compute activations in float16
Compute loss in float16
Backward pass: gradients in float16
Update: convert gradients to float32, update float32 master weights

Loss scaling: To prevent gradient underflow in float16, multiply loss by a scale factor before backward pass, then divide gradients by the same factor before update.

# Mixed precision training (conceptual)
with autocast():  # Context for automatic mixed precision
    outputs = model(inputs)  # float16 forward
    loss = criterion(outputs, targets)

scaler.scale(loss).backward()  # Scaled float16 backward
scaler.step(optimizer)         # Unscale + update
scaler.update()                # Adjust scale factor

Mixed precision can provide 2-3x speedup on modern GPUs with minimal accuracy loss.

Forward Pass in Modern Frameworks

Let's see how forward passes work in practice across major frameworks, understanding their idioms and optimizations.

PyTorch Forward Pass

PyTorch uses eager execution with dynamic graphs. The forward pass is accomplished by calling the model as a function:

pytorch_forward.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class ConvNet(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.pool = nn.MaxPool2d(2, 2)
        self.fc1 = nn.Linear(64 * 8 * 8, 256)
        self.fc2 = nn.Linear(256, num_classes)
        self.dropout = nn.Dropout(0.5)
    
    def forward(self, x):
        # Forward pass defined explicitly
        # x: (batch, 3, 32, 32)
        
        x = F.relu(self.conv1(x))  # (batch, 32, 32, 32)
        x = self.pool(x)           # (batch, 32, 16, 16)
        
        x = F.relu(self.conv2(x))  # (batch, 64, 16, 16)
        x = self.pool(x)           # (batch, 64, 8, 8)
        
        x = x.view(x.size(0), -1)  # Flatten: (batch, 64*8*8)
        
        x = F.relu(self.fc1(x))    # (batch, 256)
        x = self.dropout(x)
        x = self.fc2(x)            # (batch, num_classes)
        
        return x
 
# Using the model
model = ConvNet()
images = torch.randn(32, 3, 32, 32)  # Batch of 32 images
 
# Forward pass - graph built during this call
with torch.no_grad():  # No gradients needed for inference
    logits = model(images)
 
print(f"Output shape: {logits.shape}")  # torch.Size([32, 10])
 
# For training, enable gradients
logits = model(images)  # Graph ready for backprop
loss = F.cross_entropy(logits, torch.randint(0, 10, (32,)))
loss.backward()  # Backward pass on the graph built during forward

TensorFlow/Keras Forward Pass

TensorFlow 2 with Keras offers a similar experience with eager execution by default:

tensorflow_forward.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import tensorflow as tf
from tensorflow import keras
 
# Define model using Keras Sequential API
model = keras.Sequential([
    keras.layers.Conv2D(32, 3, padding='same', activation='relu', 
                        input_shape=(32, 32, 3)),
    keras.layers.MaxPooling2D(),
    keras.layers.Conv2D(64, 3, padding='same', activation='relu'),
    keras.layers.MaxPooling2D(),
    keras.layers.Flatten(),
    keras.layers.Dense(256, activation='relu'),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(10)
])
 
# Forward pass (eager by default in TF 2)
images = tf.random.normal((32, 32, 32, 3))
logits = model(images, training=False)  # training=False disables dropout
 
print(f"Output shape: {logits.shape}")  # (32, 10)
 
# Compiled forward pass for performance
@tf.function  # JIT compile for faster execution
def optimized_forward(model, images):
    return model(images, training=False)
 
# First call traces the graph, subsequent calls use compiled version
logits_fast = optimized_forward(model, images)
 
# Training with GradientTape
with tf.GradientTape() as tape:
    logits = model(images, training=True)  # training=True enables dropout
    loss = keras.losses.sparse_categorical_crossentropy(
        tf.random.uniform((32,), 0, 10, dtype=tf.int32), 
        logits, from_logits=True
    )
    loss = tf.reduce_mean(loss)
 
gradients = tape.gradient(loss, model.trainable_variables)

Training vs. Inference Mode

Many layers behave differently during training and inference. Dropout randomly zeros neurons during training but is disabled during inference. BatchNorm uses batch statistics during training but stored running statistics during inference. Always explicitly set the mode (training=True/False in TensorFlow, model.train()/model.eval() in PyTorch).

Summary and Looking Ahead

We've thoroughly explored forward pass computation—the first half of neural network training and the entirety of inference. Let's consolidate the key insights:

Key Takeaways

•Forward pass is ordered evaluation — Values flow through the graph in topological order, respecting dependencies. Each node's inputs must be computed before the node itself.
•Memory management is critical — Intermediate values must persist for the backward pass. Peak memory during forward pass determines minimum GPU memory requirements.
•Batching is essential for performance — Processing multiple examples simultaneously exploits hardware parallelism. Batch size directly affects throughput and GPU utilization.
•Eager vs. lazy evaluation have tradeoffs — Eager mode is intuitive and debuggable; lazy mode enables optimization. Modern frameworks support both via JIT compilation.
•Numerical precision affects both speed and correctness — Float16 is faster but requires care. Mixed precision provides best of both worlds. Numerical stability requires proper implementation.
•Caching enables gradient computation — The forward pass saves values needed for backward pass. Different operations cache different things (inputs, outputs, or derived quantities like masks).

What's Next

With forward pass computation understood, we're ready for the more sophisticated reverse mode automatic differentiation (backpropagation):

How gradients flow backward through the graph
The reverse-mode autodiff algorithm in detail
Why reverse mode is efficient for neural networks
Implementation of the backward pass

The forward pass provides values; the backward pass provides gradients. Together, they enable training neural networks of any architecture—the remarkable capability that powers all of modern deep learning.

Page Complete

You now have a comprehensive understanding of forward pass computation in neural networks. You understand how values propagate through computational graphs, the importance of topological ordering, memory management challenges, batching for performance, and the distinction between eager and lazy evaluation. Next, we'll explore how gradients flow in the opposite direction during backpropagation.

Forward Pass Computation

From Representation to Computation

Understanding the forward pass thoroughly is essential because:

It's half of training: Every gradient computation requires a forward pass first
Inference is pure forward pass: Deployed models execute only forward passes
Performance optimization starts here: Forward pass efficiency directly impacts training speed and inference latency
Debugging requires tracing: Understanding how values propagate reveals where computations go wrong

In this section, we'll develop a complete understanding of forward pass computation, from the mathematical foundations through practical implementation in modern frameworks.

What You Will Learn

The Forward Pass Procedure

Formal Definition:

Given a computational graph $G = (V, E)$ with:

Input nodes $V_{\text{input}}$ with assigned values
Operation nodes $V_{\text{op}}$ with associated functions $\phi_v$
Edges encoding dependencies

The forward pass computes values $v_{\text{val}}$ for all $v \in V$ such that:

The critical constraint is respecting dependencies: we can only compute $v_{\text{val}}$ after computing $u_i^{\text{val}}$ for all parents $u_i$ of $v$.

Forward Pass Algorithm

•Initialize inputs: Assign input values to all input nodes
•Topological sort: Arrange all nodes in an order where each node appears after all its parents
•Sequential evaluation: For each node in topological order, compute its value from parent values
•Cache intermediate values: Store computed values for (a) downstream computations and (b) backward pass
•Return outputs: The final nodes contain the output values

forward_pass_basic.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
def forward_pass(graph, input_values):
    """
    Execute forward pass on a computational graph.
    
    Args:
        graph: Computational graph with nodes and edges
        input_values: Dict mapping input node names to values
        
    Returns:
        Dict mapping all node names to computed values
    """
    values = {}
    
    # Step 1: Initialize input values
    for node_name, value in input_values.items():
        values[node_name] = value
    
    # Step 2: Get topological ordering
    topo_order = topological_sort(graph)
    
    # Step 3: Evaluate each node in order
    for node in topo_order:
        if node.name in values:
            # Already has value (input node)
            continue
            
        # Get parent values
        parent_values = [values[p.name] for p in node.parents]
        
        # Apply operation
        values[node.name] = node.operation(*parent_values)
        
        # Optional: Store value in node for backward pass
        node.cached_value = values[node.name]
    
    return values
 
def topological_sort(graph):
    """
    Return nodes in topological order (Kahn's algorithm).
    """
    in_degree = {node: len(node.parents) for node in graph.nodes}
    queue = [node for node in graph.nodes if in_degree[node] == 0]
    result = []
    
    while queue:
        node = queue.pop(0)
        result.append(node)
        for child in node.children:
            in_degree[child] -= 1
            if in_degree[child] == 0:
                queue.append(child)
    
    return result

Why Topological Order Matters

Tracing Through Concrete Examples

Let's trace forward passes through increasingly complex examples to build intuition.

Example 1: Simple Arithmetic Expression

$$f(x, y) = (x + y) \cdot x$$

with $x = 3$, $y = 2$.

Graph structure:

Node A: Input $x$
Node B: Input $y$
Node C: $A + B$ (addition)
Node D: $C \cdot A$ (multiplication)

Forward pass trace:

Step	Node	Operation	Parent Values	Result
1	A	input	—	3
2	B	input	—	2
3	C	add	A=3, B=2	5
4	D	mul	C=5, A=3	15

Note that node A is used twice (fan-out). The forward pass simply uses its value wherever needed; the complexity of fan-out only manifests during the backward pass.

Converting Mermaid diagram...

Example 2: Multi-Layer Perceptron Forward Pass

Consider a simple MLP with one hidden layer:

$$\mathbf{h} = \sigma(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1)$$ $$\mathbf{y} = \mathbf{W}_2 \mathbf{h} + \mathbf{b}_2$$

where $\sigma$ is the ReLU activation.

Let's trace with concrete values:

mlp_forward_trace.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import numpy as np
 
# Inputs
x = np.array([[1.0], [2.0]])  # Shape: (2, 1)
 
# Parameters (layer 1)
W1 = np.array([[0.5, -0.3],   # Shape: (3, 2)
               [0.2,  0.8],
               [-0.1, 0.4]])
b1 = np.array([[0.1], [0.0], [-0.2]])  # Shape: (3, 1)
 
# Parameters (layer 2)
W2 = np.array([[0.6, -0.2, 0.5]])  # Shape: (1, 3)
b2 = np.array([[0.1]])  # Shape: (1, 1)
 
# Forward pass - step by step
print("=== Forward Pass Trace ===")
 
# Step 1: Linear transformation (layer 1)
z1 = W1 @ x + b1
print(f"z1 = W1 @ x + b1:")
print(f"   W1 @ x = {(W1 @ x).flatten()}")  # [-0.1, 1.8, 0.7]
print(f"   + b1   = {z1.flatten()}")         # [0.0, 1.8, 0.5]
 
# Step 2: ReLU activation
h = np.maximum(0, z1)
print(f"
h = ReLU(z1) = {h.flatten()}")     # [0.0, 1.8, 0.5]
 
# Step 3: Linear transformation (layer 2)
z2 = W2 @ h + b2
print(f"
z2 = W2 @ h + b2:")
print(f"   W2 @ h = {(W2 @ h).flatten()}")  # [-0.36 + 0.25] = [-0.11]
print(f"   + b2   = {z2.flatten()}")         # [-0.01]
 
print(f"
Final output y = {z2.flatten()[0]:.4f}")
 
# The graph traced:
# x → [matmul with W1] → [add b1] → z1 → [ReLU] → h → [matmul with W2] → [add b2] → y

Key Observations from Tracing

Shape tracking is essential: Each operation must receive correctly-shaped inputs. Matrix multiplication $(n \times k) \cdot (k \times m) \rightarrow (n \times m)$ has strict shape requirements.
Intermediate values must persist: After computing $\mathbf{z}_1$, we still need it for computing ReLU. After computing $\mathbf{h}$, we need it for layer 2. These values occupy memory throughout the forward pass.
Operations are atomically simple: Each step is a basic operation (matmul, add, elementwise max). The complexity emerges from composition.
Order is determined by dependencies: We couldn't compute $\mathbf{h}$ before $\mathbf{z}_1$, or $\mathbf{y}$ before $\mathbf{h}$. The data dependencies dictate order.

Memory Management During Forward Pass

Memory Timeline

During a forward pass, memory usage follows a characteristic pattern:

Allocation phase: As we compute each node, new memory is allocated for results
Peak usage: Occurs when many intermediate values exist simultaneously
Retention for backward: Values needed for gradients must persist
Release after backward: Memory freed when gradients are computed

For a linear sequence of $n$ layers with activation size $a$, naive memory usage is $O(n \cdot a)$ because all intermediate activations must be retained for the backward pass.

What Gets Cached and Why

Not all forward pass values are needed for the backward pass—only those that appear in gradient computations. Consider the operations:

Operation	Forward	Backward needs
$y = x + z$	Sum values	Nothing (gradient just passes through)
$y = x \cdot z$	Product values	Both $x$ and $z$ (product rule)
$y = \text{ReLU}(x)$	Max with 0	The mask $x > 0$ (not the actual values)
$y = \sigma(x)$ sigmoid	$\frac{1}{1+e^{-x}}$	The output $y$ (since $\sigma' = \sigma(1-\sigma)$)
$y = Wx$	Matrix multiply	Both $W$ and $x$

Memory-Computation Tradeoff

memory_aware_forward.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
class MemoryEfficientNode:
    """Node that stores minimal information for backward pass."""
    
    def forward_and_cache(self, *inputs):
        """
        Compute forward value and cache ONLY what's needed for backward.
        """
        raise NotImplementedError
 
class ReLUNode(MemoryEfficientNode):
    def forward_and_cache(self, x):
        # Only cache the mask, not the full input
        self.cached_mask = x > 0  # Boolean tensor (1/8 size of float32)
        return np.maximum(0, x)
    
    def backward(self, grad_output):
        # Use cached mask
        return grad_output * self.cached_mask
 
class MatMulNode(MemoryEfficientNode):
    def forward_and_cache(self, x, W):
        # Must cache both operands for backward
        self.cached_x = x
        self.cached_W = W
        return W @ x
    
    def backward(self, grad_output):
        # ∂L/∂W = grad_output @ x^T
        # ∂L/∂x = W^T @ grad_output
        grad_W = grad_output @ self.cached_x.T
        grad_x = self.cached_W.T @ grad_output
        return grad_x, grad_W
 
class SigmoidNode(MemoryEfficientNode):
    def forward_and_cache(self, x):
        # Cache output (not input) - sufficient for sigmoid derivative
        self.cached_output = 1.0 / (1.0 + np.exp(-x))
        return self.cached_output
    
    def backward(self, grad_output):
        # σ'(x) = σ(x)(1 - σ(x)) = cached_output * (1 - cached_output)
        return grad_output * self.cached_output * (1 - self.cached_output)
 
# Memory analysis for a transformer layer (approximate)
# Batch=32, SeqLen=512, HiddenDim=768, Heads=12
#
# Self-attention activations to cache:
#   - Q, K, V projections: 3 * 32 * 512 * 768 * 4 bytes = 144 MB
#   - Attention weights: 32 * 12 * 512 * 512 * 4 bytes = 384 MB
#   - Attention output: 32 * 512 * 768 * 4 bytes = 48 MB
# 
# Just one transformer layer: ~600-800 MB of cached activations
# A 12-layer model: ~7-10 GB just for activation caching!

Memory Optimization Strategies

1. In-place Operations: Some operations can modify tensors in-place, avoiding allocation:

# Instead of: y = x + 1  (allocates new tensor)
# Use:        x += 1     (modifies in-place)

However, in-place operations can break gradient computation if the original value is needed!

2. Memory Pooling: Frameworks often maintain memory pools, reusing allocations across iterations rather than repeatedly allocating and freeing.

3. Mixed Precision: Using float16 instead of float32 halves memory for activations. Modern GPUs have specialized hardware (Tensor Cores) that accelerate float16 computation.

4. Gradient Checkpointing: Strategically discard intermediate activations and recompute them during backward pass:

Divide network into segments
Only keep activations at segment boundaries
Recompute within-segment activations during backward
Trades ~33% extra compute for O(√n) instead of O(n) memory

Vectorization and Batching

Why Batching Matters

Consider a matrix-vector multiplication for a single example: $$\mathbf{y} = \mathbf{W} \mathbf{x}, \quad \mathbf{W} \in \mathbb{R}^{m \times n}, \mathbf{x} \in \mathbb{R}^{n}$$

Computing this once: $O(mn)$ operations, $O(mn)$ memory accesses.

Now consider batching $B$ examples: $$\mathbf{Y} = \mathbf{W} \mathbf{X}, \quad \mathbf{X} \in \mathbb{R}^{n \times B}, \mathbf{Y} \in \mathbb{R}^{m \times B}$$

Computing this: $O(mnB)$ operations, but memory access for $\mathbf{W}$ is amortized across $B$ examples!

Batching Performance Impact
Batch Size	Throughput (examples/sec)	GPU Utilization	Memory Usage
1	100	5%	Low
16	1,200	40%	Moderate
64	3,500	75%	High
256	5,000	90%	Very High
1024	5,200	95%	Maximum

The Batch Dimension Convention

By convention, tensors in deep learning have specific dimension orderings:

Tensor Type	Convention	Example Shape
Dense features	(Batch, Features)	(32, 784)
Images (PyTorch)	(Batch, Channels, Height, Width)	(32, 3, 224, 224)
Images (TensorFlow)	(Batch, Height, Width, Channels)	(32, 224, 224, 3)
Sequences	(Batch, SeqLen, Features)	(32, 128, 512)
Sequences (packed)	(Batch, SeqLen, Features) or (SeqLen, Batch, Features)	varies

The batch dimension is typically the first or second dimension, enabling efficient memory layout for parallel processing.

batched_forward.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
import time
 
def unbatched_forward(X, W1, b1, W2, b2):
    """Process examples one at a time."""
    results = []
    for x in X:
        x = x.reshape(-1, 1)  # Column vector
        h = np.maximum(0, W1 @ x + b1)
        y = W2 @ h + b2
        results.append(y.flatten())
    return np.array(results)
 
def batched_forward(X, W1, b1, W2, b2):
    """Process all examples simultaneously."""
    # X shape: (batch, features) -> (features, batch) for matmul
    X_T = X.T
    h = np.maximum(0, W1 @ X_T + b1)
    y = W2 @ h + b2
    return y.T  # Back to (batch, outputs)
 
# Benchmark
np.random.seed(42)
batch_size = 1000
input_dim, hidden_dim, output_dim = 784, 256, 10
 
X = np.random.randn(batch_size, input_dim).astype(np.float32)
W1 = np.random.randn(hidden_dim, input_dim).astype(np.float32) * 0.01
b1 = np.zeros((hidden_dim, 1), dtype=np.float32)
W2 = np.random.randn(output_dim, hidden_dim).astype(np.float32) * 0.01
b2 = np.zeros((output_dim, 1), dtype=np.float32)
 
# Time unbatched
start = time.time()
y_unbatched = unbatched_forward(X, W1, b1, W2, b2)
unbatched_time = time.time() - start
 
# Time batched
start = time.time()
y_batched = batched_forward(X, W1, b1, W2, b2)
batched_time = time.time() - start
 
print(f"Unbatched: {unbatched_time*1000:.2f} ms")
print(f"Batched:   {batched_time*1000:.2f} ms")
print(f"Speedup:   {unbatched_time/batched_time:.1f}x")
print(f"Results match: {np.allclose(y_unbatched, y_batched)}")
 
# Typical output (CPU):
# Unbatched: 125.32 ms
# Batched:   3.45 ms
# Speedup:   36.3x

The Parallelism Hierarchy

Eager vs. Lazy Evaluation

Eager Evaluation (Define-by-Run)

In eager mode, operations execute immediately when called. Each line of code produces an actual computed value.

# Eager mode (PyTorch, TensorFlow eager)
x = torch.tensor([1.0, 2.0])
y = x * 2  # Computation happens HERE
print(y)   # tensor([2., 4.]) - actual values available

Advantages:

Intuitive debugging: Insert print statements, use standard debuggers, inspect values at any point
Natural control flow: Python if/else and loops work normally
Immediate feedback: Errors surface at the offending line

Disadvantages:

Limited optimization: Framework can't see the full computation before executing
Operation overhead: Each operation has Python overhead
Memory inefficiency: Harder to plan memory allocation across operations

Lazy Evaluation (Define-and-Run)

In lazy mode, operations build a symbolic graph without executing. Computation occurs only when explicitly requested (e.g., session.run() in TF 1.x, or when a lazy tensor is materialized).

# Lazy mode (TensorFlow 1.x style)
x = tf.placeholder(tf.float32, shape=[2])
y = x * 2  # No computation - just graph building
# y is a symbolic tensor, not actual values

with tf.Session() as sess:
    result = sess.run(y, feed_dict={x: [1.0, 2.0]})  # NOW it computes
    print(result)  # [2., 4.]

Advantages:

Global optimization: Framework sees entire graph before execution, enabling fusion, reordering, parallelization
Deployment: Frozen graphs can be deployed without Python runtime
Hardware optimization: Graph can be compiled for specific accelerators (XLA, TensorRT)

Disadvantages:

Difficult debugging: Errors reference symbolic nodes, not source lines
Complex control flow: Requires special constructs (tf.cond, tf.while_loop)
Mental overhead: Must think symbolically, not imperatively

Eager Evaluation

•Operations execute immediately
•Easy debugging with print/debugger
•Natural Python control flow
•Graph built dynamically
•Less optimization opportunity
•Examples: PyTorch, TF 2 default, JAX eager

Lazy Evaluation

•Operations build symbolic graph
•Debugging requires graph inspection
•Special control flow primitives
•Graph built before execution
•Maximum optimization potential
•Examples: TF 1.x, JAX JIT, TorchScript

Modern Hybrid Approaches

Modern frameworks blend both approaches:

TensorFlow 2 + tf.function: Default is eager, but decorating a function with @tf.function traces it into a graph:

@tf.function  # Traces to graph for optimization
def forward(x, W):
    return tf.nn.relu(tf.matmul(x, W))

y = forward(data, weights)  # First call: trace graph. Subsequent: fast execution

JAX: Default is eager, but jax.jit compiles functions:

@jax.jit  # JIT compile with XLA
def forward(x, W):
    return jax.nn.relu(x @ W)

PyTorch + TorchScript:

scripted_model = torch.jit.script(model)  # Convert to TorchScript graph

This hybrid approach gives developers eager mode for development and debugging, with lazy/compiled mode for production performance.

Numerical Precision in Forward Computation

Forward pass computations operate in floating-point arithmetic, which introduces numerical considerations that can affect both correctness and performance.

Floating-Point Precision Levels

Deep learning frameworks support multiple precision levels:

Precision	Bits	Range	Significant Digits	Use Case
float64 (double)	64	±10³⁰⁸	15-16	Scientific computing, rarely in DL
float32 (single)	32	±10³⁸	6-7	Default for most DL training
float16 (half)	16	±65504	3-4	GPU-accelerated training with loss scaling
bfloat16	16	±10³⁸	2-3	TPUs, broader range than float16
int8	8	-128 to 127	N/A	Inference quantization

The choice of precision involves tradeoffs between accuracy, memory, and computation speed.

Numerical Stability Issues

numerical_stability.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import numpy as np
 
def softmax_naive(logits):
    """Naive softmax - numerically unstable!"""
    exp_logits = np.exp(logits)  # Can overflow for large logits
    return exp_logits / np.sum(exp_logits)
 
def softmax_stable(logits):
    """Stable softmax using log-sum-exp trick."""
    # Subtract max for numerical stability
    shifted = logits - np.max(logits)
    exp_shifted = np.exp(shifted)  # Now safe: max value is exp(0) = 1
    return exp_shifted / np.sum(exp_shifted)
 
def log_softmax_stable(logits):
    """Stable log-softmax - avoids log of small numbers."""
    max_logit = np.max(logits)
    # log(softmax(x)) = x - max(x) - log(sum(exp(x - max(x))))
    shifted = logits - max_logit
    log_sum_exp = np.log(np.sum(np.exp(shifted)))
    return shifted - log_sum_exp
 
# Demonstration
large_logits = np.array([1000.0, 1001.0, 1002.0])
 
print("Testing with large logits:", large_logits)
try:
    naive_result = softmax_naive(large_logits)
    print(f"Naive softmax: {naive_result}")
except RuntimeWarning as e:
    print(f"Naive softmax WARNING: overflow occurred")
 
stable_result = softmax_stable(large_logits)
print(f"Stable softmax: {stable_result}")
 
log_sm = log_softmax_stable(large_logits)
print(f"Log-softmax: {log_sm}")
 
# Cross-entropy from log-softmax (stable)
target = 2  # True class
loss = -log_sm[target]
print(f"Cross-entropy loss: {loss}")
 
# Output:
# Testing with large logits: [1000. 1001. 1002.]
# Stable softmax: [0.09003057 0.24472847 0.66524096]
# Log-softmax: [-2.40760596 -1.40760596 -0.40760596]
# Cross-entropy loss: 0.407605958...

Mixed Precision Training

Modern training often uses mixed precision: most operations in float16 for speed, with critical operations in float32 for stability.

The technique:

Store master weights in float32
Forward pass: cast weights to float16, compute activations in float16
Compute loss in float16
Backward pass: gradients in float16
Update: convert gradients to float32, update float32 master weights

Loss scaling: To prevent gradient underflow in float16, multiply loss by a scale factor before backward pass, then divide gradients by the same factor before update.

# Mixed precision training (conceptual)
with autocast():  # Context for automatic mixed precision
    outputs = model(inputs)  # float16 forward
    loss = criterion(outputs, targets)

scaler.scale(loss).backward()  # Scaled float16 backward
scaler.step(optimizer)         # Unscale + update
scaler.update()                # Adjust scale factor

Mixed precision can provide 2-3x speedup on modern GPUs with minimal accuracy loss.

Forward Pass in Modern Frameworks

Let's see how forward passes work in practice across major frameworks, understanding their idioms and optimizations.

PyTorch Forward Pass

PyTorch uses eager execution with dynamic graphs. The forward pass is accomplished by calling the model as a function:

pytorch_forward.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class ConvNet(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.pool = nn.MaxPool2d(2, 2)
        self.fc1 = nn.Linear(64 * 8 * 8, 256)
        self.fc2 = nn.Linear(256, num_classes)
        self.dropout = nn.Dropout(0.5)
    
    def forward(self, x):
        # Forward pass defined explicitly
        # x: (batch, 3, 32, 32)
        
        x = F.relu(self.conv1(x))  # (batch, 32, 32, 32)
        x = self.pool(x)           # (batch, 32, 16, 16)
        
        x = F.relu(self.conv2(x))  # (batch, 64, 16, 16)
        x = self.pool(x)           # (batch, 64, 8, 8)
        
        x = x.view(x.size(0), -1)  # Flatten: (batch, 64*8*8)
        
        x = F.relu(self.fc1(x))    # (batch, 256)
        x = self.dropout(x)
        x = self.fc2(x)            # (batch, num_classes)
        
        return x
 
# Using the model
model = ConvNet()
images = torch.randn(32, 3, 32, 32)  # Batch of 32 images
 
# Forward pass - graph built during this call
with torch.no_grad():  # No gradients needed for inference
    logits = model(images)
 
print(f"Output shape: {logits.shape}")  # torch.Size([32, 10])
 
# For training, enable gradients
logits = model(images)  # Graph ready for backprop
loss = F.cross_entropy(logits, torch.randint(0, 10, (32,)))
loss.backward()  # Backward pass on the graph built during forward

TensorFlow/Keras Forward Pass

TensorFlow 2 with Keras offers a similar experience with eager execution by default:

tensorflow_forward.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import tensorflow as tf
from tensorflow import keras
 
# Define model using Keras Sequential API
model = keras.Sequential([
    keras.layers.Conv2D(32, 3, padding='same', activation='relu', 
                        input_shape=(32, 32, 3)),
    keras.layers.MaxPooling2D(),
    keras.layers.Conv2D(64, 3, padding='same', activation='relu'),
    keras.layers.MaxPooling2D(),
    keras.layers.Flatten(),
    keras.layers.Dense(256, activation='relu'),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(10)
])
 
# Forward pass (eager by default in TF 2)
images = tf.random.normal((32, 32, 32, 3))
logits = model(images, training=False)  # training=False disables dropout
 
print(f"Output shape: {logits.shape}")  # (32, 10)
 
# Compiled forward pass for performance
@tf.function  # JIT compile for faster execution
def optimized_forward(model, images):
    return model(images, training=False)
 
# First call traces the graph, subsequent calls use compiled version
logits_fast = optimized_forward(model, images)
 
# Training with GradientTape
with tf.GradientTape() as tape:
    logits = model(images, training=True)  # training=True enables dropout
    loss = keras.losses.sparse_categorical_crossentropy(
        tf.random.uniform((32,), 0, 10, dtype=tf.int32), 
        logits, from_logits=True
    )
    loss = tf.reduce_mean(loss)
 
gradients = tape.gradient(loss, model.trainable_variables)

Training vs. Inference Mode

Summary and Looking Ahead

We've thoroughly explored forward pass computation—the first half of neural network training and the entirety of inference. Let's consolidate the key insights:

Key Takeaways

•Forward pass is ordered evaluation — Values flow through the graph in topological order, respecting dependencies. Each node's inputs must be computed before the node itself.
•Memory management is critical — Intermediate values must persist for the backward pass. Peak memory during forward pass determines minimum GPU memory requirements.
•Batching is essential for performance — Processing multiple examples simultaneously exploits hardware parallelism. Batch size directly affects throughput and GPU utilization.
•Eager vs. lazy evaluation have tradeoffs — Eager mode is intuitive and debuggable; lazy mode enables optimization. Modern frameworks support both via JIT compilation.
•Numerical precision affects both speed and correctness — Float16 is faster but requires care. Mixed precision provides best of both worlds. Numerical stability requires proper implementation.
•Caching enables gradient computation — The forward pass saves values needed for backward pass. Different operations cache different things (inputs, outputs, or derived quantities like masks).

What's Next

With forward pass computation understood, we're ready for the more sophisticated reverse mode automatic differentiation (backpropagation):

How gradients flow backward through the graph
The reverse-mode autodiff algorithm in detail
Why reverse mode is efficient for neural networks
Implementation of the backward pass

Page Complete