Machine LearningBackpropagation & Optimization

Computational Graphs

LevelIntermediate

Duration90 mins

TopicBackpropagation & Optimization

1 / 5

Graph Representation

The Abstraction Behind Deep Learning

Every time you train a neural network—whether a simple multilayer perceptron or a massive transformer with billions of parameters—a remarkable computational machinery operates behind the scenes. This machinery takes your network's computations, automatically derives gradients for every single parameter, and does so with stunning efficiency. The foundation of this machinery is a surprisingly elegant mathematical abstraction: the computational graph.

A computational graph transforms complex mathematical expressions into structured diagrams where nodes represent operations or values, and edges represent data dependencies. This transformation is far more than a visualization technique—it's the conceptual and algorithmic foundation that enables automatic differentiation, making it possible to train neural networks with millions or billions of parameters.

Without computational graphs, we would need to manually derive gradients for every architecture change, an impossibility for modern deep learning. With them, we write forward computations, and gradients appear automatically, correctly, and efficiently.

What You Will Learn

By the end of this page, you will understand: (1) How computational graphs represent mathematical expressions as structured diagrams, (2) The distinction between nodes, edges, and their semantics, (3) How complex functions decompose into elementary operations, (4) The mathematical foundation that enables automatic gradient computation, and (5) Why this representation is fundamental to all modern deep learning frameworks.

The Computational Graph Concept

A computational graph is a directed graph that represents the structure of a mathematical computation. In this graph:

Nodes represent either inputs, intermediate values, or operations
Edges represent data flow from one node to another
The graph structure encodes how outputs depend on inputs through chains of elementary operations

Consider a simple expression: $f(x, y) = (x + y) \cdot \sin(x)$

Instead of treating this as a monolithic function, we decompose it into elementary steps:

Compute $a = x + y$ (addition)
Compute $b = \sin(x)$ (sine function)
Compute $f = a \cdot b$ (multiplication)

This decomposition forms a computational graph where each step is a node, and edges connect operations that depend on previous results. The power of this representation is that each elementary operation has a known, simple derivative, and the chain rule allows us to compose these simple derivatives to obtain gradients of the entire function.

The Key Insight

Computational graphs transform the problem of differentiating complex functions into the much simpler problem of differentiating elementary operations and then systematically combining these derivatives using the chain rule. Any function expressible as compositions of differentiable elementary operations can be automatically differentiated.

Graph Types and Terminology

Computational graphs can take two primary forms, depending on what nodes represent:

Expression Graphs (Value-Based): In this representation, each node represents a value (either an input or an intermediate result), and edges represent operations that produce new values from existing ones. This is the more intuitive representation and is commonly used for explanation.

Operation Graphs (Operation-Based): Here, nodes represent operations, and edges represent the flow of values between operations. This representation is closer to how many frameworks implement computational graphs internally.

Both representations are mathematically equivalent—they're just different ways of visualizing the same underlying computation. What matters is that both capture the essential structure: which outputs depend on which inputs, through which intermediate computations.

Converting Mermaid diagram...

Formal Definition and Properties

Let us formalize the computational graph concept with mathematical precision. This rigor is essential for understanding automatic differentiation algorithms.

Definition: A computational graph $G = (V, E)$ is a directed acyclic graph (DAG) where:

$V = V_{\text{input}} \cup V_{\text{intermediate}} \cup V_{\text{output}}$ is the set of nodes
$E \subseteq V \times V$ is the set of directed edges
Each non-input node $v \in V_{\text{intermediate}} \cup V_{\text{output}}$ has an associated elementary operation $\phi_v$
For each non-input node $v$, if its parents (nodes with edges pointing to $v$) are ${u_1, u_2, ..., u_k}$, then the value at $v$ is $v = \phi_v(u_1, u_2, ..., u_k)$

The acyclic property is crucial: it guarantees that we can always evaluate the graph by processing nodes in a topological order, where each node's inputs are computed before the node itself.

Essential Properties of Computational Graphs

•Directed Acyclic Structure — The graph must be a DAG. Cycles would create circular dependencies, making values undefined. This ensures every computation can be evaluated in finite time.
•Elementary Operation Decomposition — Every complex function is decomposed into operations from a fixed library (addition, multiplication, exp, log, sin, etc.). Each elementary operation has a known, easily-computed derivative.
•Local Dependency Encoding — Each node depends only on its immediate parents. This locality is key: we never need to understand the entire graph to compute how one node depends on another—just the path between them.
•Single Static Assignment (SSA) Form — Each intermediate value is computed exactly once. This explicit naming of every intermediate result enables efficient gradient tracking.
•Well-Defined Evaluation Order — Any topological sort of the DAG gives a valid evaluation order. The graph structure implicitly encodes all valid orderings.

The Library of Elementary Operations

A computational graph framework defines a set of primitive operations that serve as building blocks. Every computation must ultimately decompose into these primitives. Typical primitive operations include:

Category	Operations
Arithmetic	$+$, $-$, $\times$, $\div$, $x^n$
Transcendental	$\exp$, $\log$, $\sin$, $\cos$, $\tanh$
Comparison	$\max$, $\min$, $
Linear Algebra	Matrix multiplication, transpose, inverse
Reduction	Sum, product, mean over axes
Selection	Indexing, slicing, concatenation

For each primitive, the framework must know:

How to compute the forward value given inputs
How to compute partial derivatives with respect to each input

This is the contract that enables automatic differentiation: as long as every primitive satisfies this contract, any composition of primitives can be differentiated automatically.

Why Primitives Must Be Carefully Chosen

The choice of primitives affects both expressiveness and efficiency. Too few primitives force convoluted implementations of common operations; too many create maintenance burden. Modern frameworks strike a balance with 100-300 carefully optimized primitives that cover nearly all practical needs efficiently.

Decomposing Complex Functions

The power of computational graphs becomes evident when decomposing genuinely complex functions. Let's trace through a realistic example: the softmax cross-entropy loss, ubiquitous in classification tasks.

Given logits $\mathbf{z} = [z_1, z_2, ..., z_K]$ and true class $y$, the softmax cross-entropy loss is:

$$L = -\log\left(\frac{\exp(z_y)}{\sum_{k=1}^{K} \exp(z_k)}\right)$$

This can be rewritten as:

$$L = -z_y + \log\left(\sum_{k=1}^{K} \exp(z_k)\right)$$

Now let's decompose this into a computational graph:

softmax_cross_entropy_decomposition
Pseudocode
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Decomposition of Softmax Cross-Entropy Loss into Elementary Operations
# Input: logits z = [z_1, z_2, ..., z_K], true class y
 
# Step 1: Compute exp of each logit
e_1 = exp(z_1)
e_2 = exp(z_2)
# ...
e_K = exp(z_K)
 
# Step 2: Sum all exponentials  
s = e_1 + e_2 + ... + e_K
 
# Step 3: Compute log of sum
log_s = log(s)
 
# Step 4: Select the logit for true class
z_y = select(z, y)  # z[y]
 
# Step 5: Negate the true class logit
neg_z_y = -z_y
 
# Step 6: Add to get final loss
L = neg_z_y + log_s
 
# The graph structure:
#   z_i → exp → e_i → (+) → s → log → log_s
#                                        ↓
#   z_y → (select) → neg → neg_z_y → (+) → L

Each operation in this decomposition—exp, log, +, -, select—has a well-defined derivative:

Operation	Forward	Derivative w.r.t. input
$\exp(x)$	$e^x$	$e^x$
$\log(x)$	$\ln(x)$	$1/x$
$x + y$	$x + y$	$1$ for each input
$-x$	$-x$	$-1$
$\text{select}(\mathbf{z}, y)$	$z_y$	$1$ at position $y$, $0$ elsewhere

The magic is that once the graph is constructed, derivatives flow automatically. The gradient of $L$ with respect to any input $z_i$ is computed by tracing paths through the graph and multiplying derivatives along each path—this is precisely the chain rule.

Numerical Stability Considerations

Decomposition must also consider numerical stability. The naive softmax computation:

$$\text{softmax}(z_i) = \frac{\exp(z_i)}{\sum_j \exp(z_j)}$$

can cause overflow when $z_i$ is large. The stable version uses the log-sum-exp trick:

$$\text{logsumexp}(\mathbf{z}) = z_{\max} + \log\left(\sum_j \exp(z_j - z_{\max})\right)$$

Computational graph frameworks often provide fused operations that combine multiple elementary operations with numerical stability, even though mathematically they're equivalent to the naive decomposition.

Numerical Stability in Graph Operations

While mathematical equivalence holds in exact arithmetic, floating-point computation introduces significant considerations. Frameworks often implement 'compound' operations (like log_softmax_cross_entropy) as single primitives to ensure numerical stability, even though they could theoretically decompose into elementary operations.

Neural Networks as Computational Graphs

Neural networks are naturally expressed as computational graphs. Each layer, each activation, each loss computation becomes a subgraph within the larger network graph.

Consider a simple two-layer neural network with ReLU activation:

$$\hat{y} = \mathbf{W}_2 \cdot \text{ReLU}(\mathbf{W}_1 \cdot \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2$$ $$L = \frac{1}{2}|y - \hat{y}|^2$$

This decomposes into the following computational graph:

Converting Mermaid diagram...

In this graph:

Blue nodes are inputs and targets (fixed during training)
Yellow nodes are parameters (what we're optimizing)
White nodes are intermediate computations
Red node is the final loss we want to minimize

Every Path Matters

To compute the gradient of the loss $L$ with respect to a parameter like $\mathbf{W}_1$, we must trace all paths from $L$ back to $\mathbf{W}_1$. In this simple network, there's exactly one path:

$$L \leftarrow \text{mean} \leftarrow \text{square} \leftarrow \text{subtract} \leftarrow \text{add}_2 \leftarrow \text{matmul}_2 \leftarrow \text{ReLU} \leftarrow \text{add}_1 \leftarrow \text{matmul}_1 \leftarrow \mathbf{W}_1$$

The gradient is the product of partial derivatives along this path. This is the chain rule at work:

$$\frac{\partial L}{\partial \mathbf{W}_1} = \frac{\partial L}{\partial \text{mean}} \cdot \frac{\partial \text{mean}}{\partial \text{square}} \cdot ... \cdot \frac{\partial \text{matmul}_1}{\partial \mathbf{W}_1}$$

In deeper networks, the number of operations grows, but the principle remains identical: trace paths, multiply derivatives, accumulate contributions.

Parameters vs. Inputs vs. Intermediate Values

The computational graph distinguishes between: (1) Inputs — data that changes each batch but aren't optimized, (2) Parameters — values we update via gradient descent (W₁, W₂, b₁, b₂), and (3) Intermediate values — computed from inputs and parameters, never directly modified. Gradients are only meaningful for parameters; that's what we optimize.

Static vs. Dynamic Graphs

A fundamental architectural decision in computational graph frameworks is whether to use static or dynamic graph construction. This choice profoundly affects usability, debugging, performance, and flexibility.

Static Graphs (Define-and-Run)

In static graph frameworks, you first define the complete computational graph, then compile it, and finally execute it repeatedly with different data.

The workflow is:

Build graph symbolically (no actual computation)
Compile and optimize the graph
Feed data and execute repeatedly

Advantages:

Optimization opportunities: The framework sees the entire graph before execution, enabling aggressive optimizations (operation fusion, memory planning, parallelization)
Deployment simplicity: Graph can be serialized and deployed without the original Python code
Performance: Compiled graphs execute faster, especially for repeated execution

Disadvantages:

Debugging difficulty: Errors occur in the compiled graph, not your code; stack traces are cryptic
Reduced flexibility: Control flow (if/else, variable-length loops) becomes complex
Steeper learning curve: Must think in terms of symbolic operations, not imperative code

static_graph_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Static Graph Approach (TensorFlow 1.x style)
import tensorflow as tf
 
# Phase 1: Define graph symbolically
x = tf.placeholder(tf.float32, [None, 784])
weights = tf.Variable(tf.random.normal([784, 10]))
logits = tf.matmul(x, weights)
loss = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(
        labels=labels, logits=logits
    )
)
train_op = tf.train.GradientDescentOptimizer(
    learning_rate=0.01
).minimize(loss)
 
# Phase 2: Execute graph
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for batch_x, batch_y in data_loader:
        # Graph executed here
        _, loss_val = sess.run(
            [train_op, loss],
            feed_dict={x: batch_x, labels: batch_y}
        )

dynamic_graph_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Dynamic Graph Approach (PyTorch/TF2 style)
import torch
import torch.nn as nn
 
# Define model (no graph built yet)
model = nn.Linear(784, 10)
optimizer = torch.optim.SGD(
    model.parameters(), lr=0.01
)
loss_fn = nn.CrossEntropyLoss()
 
# Training loop - graph built on each forward
for batch_x, batch_y in data_loader:
    # Graph built here, during execution
    logits = model(batch_x)
    loss = loss_fn(logits, batch_y)
    
    # Backward pass (traverses graph)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    # Graph discarded after each iteration
    # (unless explicitly retained)

Dynamic Graphs (Define-by-Run)

In dynamic graph frameworks, the computational graph is constructed during execution. Each forward pass builds a new graph.

Advantages:

Natural Python control flow: Use regular if/else, for, while — they just work
Easy debugging: Standard debuggers, print statements, and stack traces work normally
Research flexibility: Modify architecture between iterations, ideal for experimentation
Intuitive mental model: Code executes as written, no symbolic abstraction

Disadvantages:

Less optimization opportunity: Framework can't see the whole graph until execution
Graph construction overhead: Building graph on every iteration has cost
Deployment complexity: Need Python runtime (though modern solutions like TorchScript address this)

Modern Convergence

Modern frameworks increasingly blur this distinction. TensorFlow 2 adopted eager execution by default (dynamic) with tf.function for optimization. PyTorch added TorchScript for static graph compilation. The trend is toward dynamic for development, static for deployment.

Static vs. Dynamic Graph Comparison
Aspect	Static Graphs	Dynamic Graphs
Graph construction	Before execution (define phase)	During execution (on-the-fly)
Control flow	Special constructs (tf.cond, tf.while_loop)	Native Python (if/else, for/while)
Debugging	Challenging (symbolic errors)	Natural (standard Python debugging)
Optimization potential	High (full graph visible)	Lower (incremental construction)
Deployment	Easy serialization	Requires runtime or compilation
Flexibility	Less flexible	Highly flexible
Learning curve	Steeper	Gentler
Example frameworks	TF 1.x, Theano, Caffe	PyTorch, TF 2.x (eager), JAX

Node and Edge Semantics

To implement and reason about computational graphs precisely, we must understand the exact semantics of nodes and edges. This detail matters for both building frameworks and debugging gradient computations.

Node Semantics

Each node in a computational graph carries:

Value: The result of the computation at this node
Operation: The function applied to compute the value (for non-input nodes)
Parents: References to nodes whose values are inputs to this operation
Gradient function: How to compute gradients with respect to parent values
Shape/dtype metadata: Information about tensor dimensions and data types

For gradient computation, the critical information is the gradient function (often called the VJP — Vector-Jacobian Product — function). This function answers: "Given the gradient of the loss with respect to this node's output, what are the gradients with respect to each input?"

node_semantics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# Conceptual node structure for a computational graph
class Node:
    def __init__(self, operation, inputs, value=None):
                        self.operation = operation      # e.g., "multiply", "relu", "matmul"
        self.inputs = inputs            # List of parent Node references
        self.value = value              # Computed during forward pass
        self.grad = None                # Accumulated gradient (set during backward)
        self.shape = None               # Tensor shape
        self.dtype = None               # Data type (float32, etc.)
        
    def compute_forward(self):
        """Compute this node's value from input values."""
        input_values = [inp.value for inp in self.inputs]
        self.value = FORWARD_FUNCTIONS[self.operation](*input_values)
        return self.value
    
    def compute_backward(self, output_grad):
        """
        Given gradient of loss w.r.t. this node's output,
        compute gradients w.r.t. each input.
        
        This is the VJP (Vector-Jacobian Product) operation.
        """
        input_values = [inp.value for inp in self.inputs]
        # Get gradients for each input
        input_grads = VJP_FUNCTIONS[self.operation](
            output_grad, self.value, *input_values
        )
        # Accumulate into input nodes
        for inp, grad in zip(self.inputs, input_grads):
            if inp.grad is None:
                inp.grad = grad
            else:
                inp.grad = inp.grad + grad  # Accumulate!
 
# Example VJP implementation for multiplication
def multiply_vjp(output_grad, output_value, x, y):
    """VJP for z = x * y"""
    # ∂L/∂x = ∂L/∂z * ∂z/∂x = output_grad * y
    # ∂L/∂y = ∂L/∂z * ∂z/∂y = output_grad * x
    grad_x = output_grad * y
    grad_y = output_grad * x
    return (grad_x, grad_y)

Edge Semantics and Fan-Out

Edges represent data dependencies, but a subtle issue arises: fan-out. When one node's output is used by multiple downstream nodes, the gradients from all downstream paths must be summed.

Mathematically, if a value $v$ is used in computing $f_1(v)$ and $f_2(v)$, and ultimately both contribute to loss $L$:

$$\frac{\partial L}{\partial v} = \frac{\partial L}{\partial f_1} \cdot \frac{\partial f_1}{\partial v} + \frac{\partial L}{\partial f_2} \cdot \frac{\partial f_2}{\partial v}$$

This is the multivariate chain rule — gradients from multiple uses must be accumulated. The inp.grad = inp.grad + grad line in the code above implements exactly this accumulation.

Shape Considerations

For tensor operations, shapes must be carefully tracked. The gradient must have the same shape as the original value. This leads to implicit broadcasting in the forward pass requiring explicit reduction in the backward pass.

For example, if we add a bias $b \in \mathbb{R}^{n}$ to each row of $X \in \mathbb{R}^{m \times n}$:

Forward: $Y = X + b$ (broadcasting $b$ across $m$ rows)
Backward: $\frac{\partial L}{\partial b} = \sum_{i=1}^{m} \frac{\partial L}{\partial Y_i}$ (sum gradients across the broadcast dimension)

The Fan-Out Gradient Accumulation Rule

When a value is used multiple times in a computation, gradients from each use must be SUMMED, not overwritten. This is a common source of bugs in manual gradient implementations. Frameworks handle this automatically, but understanding it is crucial for debugging and implementing custom operations.

Building Graphs in Practice

Let's implement a minimal computational graph system to solidify understanding. This implementation demonstrates the core concepts without production complexity.

We'll build a system that can:

Track operations as they execute
Store the graph structure
Compute gradients via backpropagation

mini_autograd.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
import numpy as np
 
class Tensor:
    """A value in the computational graph with gradient tracking."""
    
    def __init__(self, data, requires_grad=False, _children=(), _op=''):
        self.data = np.array(data, dtype=np.float32)
        self.requires_grad = requires_grad
        self.grad = None
        
        # Graph structure
        self._children = set(_children)  # Parent nodes
        self._op = _op                   # Operation that created this node
        self._backward = lambda: None    # Gradient function
        
    def __repr__(self):
        return f"Tensor(data={self.data}, grad={self.grad})"
    
    def __add__(self, other):
        other = other if isinstance(other, Tensor) else Tensor(other)
        out = Tensor(
            self.data + other.data,
            requires_grad=self.requires_grad or other.requires_grad,
            _children=(self, other),
            _op='+'
        )
        
        def _backward():
            if self.requires_grad:
                self.grad = (self.grad or 0) + out.grad
            if other.requires_grad:
                other.grad = (other.grad or 0) + out.grad
        out._backward = _backward
        
        return out
    
    def __mul__(self, other):
        other = other if isinstance(other, Tensor) else Tensor(other)
        out = Tensor(
            self.data * other.data,
            requires_grad=self.requires_grad or other.requires_grad,
            _children=(self, other),
            _op='*'
        )
        
        def _backward():
            if self.requires_grad:
                self.grad = (self.grad or 0) + other.data * out.grad
            if other.requires_grad:
                other.grad = (other.grad or 0) + self.data * out.grad
        out._backward = _backward
        
        return out
    
    def relu(self):
        out = Tensor(
            np.maximum(0, self.data),
            requires_grad=self.requires_grad,
            _children=(self,),
            _op='relu'
        )
        
        def _backward():
            if self.requires_grad:
                self.grad = (self.grad or 0) + (out.data > 0) * out.grad
        out._backward = _backward
        
        return out
    
    def sum(self):
        out = Tensor(
            self.data.sum(),
            requires_grad=self.requires_grad,
            _children=(self,),
            _op='sum'
        )
        
        def _backward():
            if self.requires_grad:
                self.grad = (self.grad or 0) + np.ones_like(self.data) * out.grad
        out._backward = _backward
        
        return out
    
    def backward(self):
        """Compute gradients via reverse-mode autodiff."""
        # Topological sort
        topo = []
        visited = set()
        
        def build_topo(tensor):
            if tensor not in visited:
                visited.add(tensor)
                for child in tensor._children:
                    build_topo(child)
                topo.append(tensor)
        
        build_topo(self)
        
        # Initialize gradient of output
        self.grad = np.ones_like(self.data)
        
        # Backpropagate in reverse topological order
        for tensor in reversed(topo):
            tensor._backward()

using_mini_autograd.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Using our mini autograd system
# Compute f(x, y) = sum(relu(x * y + 2))
 
x = Tensor([1.0, 2.0, 3.0], requires_grad=True)
y = Tensor([4.0, 5.0, 6.0], requires_grad=True)
 
# Forward pass (graph built implicitly)
z = x * y           # [4, 10, 18]
z = z + 2           # [6, 12, 20]
z = z.relu()        # [6, 12, 20] (all positive, ReLU is identity)
loss = z.sum()      # 38
 
print(f"Forward result: {loss.data}")  # 38.0
 
# Backward pass
loss.backward()
 
print(f"∂L/∂x = {x.grad}")  # [4, 5, 6] = y (since ReLU didn't zero anything)
print(f"∂L/∂y = {y.grad}")  # [1, 2, 3] = x
 
# Verify manually:
# L = sum(relu(x*y + 2))
# ∂L/∂x_i = ∂L/∂z_i * ∂z_i/∂(xy)_i * ∂(xy)_i/∂x_i
#         = 1 * 1 * y_i = y_i  (since relu(.) > 0)

The Magic of Implicit Graph Construction

Notice that we never explicitly built a graph. By overloading operators (add, mul) to create new Tensor objects with _children references, the graph is built automatically during execution. This is exactly how PyTorch and modern TensorFlow work — you write normal-looking code, and the graph emerges from your operations.

Summary and Looking Ahead

We've established the foundational abstraction of computational graphs—the representation that enables all of modern deep learning's automatic differentiation capabilities. Let's consolidate the key insights:

Key Takeaways

•Computational graphs are directed acyclic graphs — Nodes represent values or operations; edges represent data dependencies. The acyclic property ensures well-defined evaluation order.
•Complex functions decompose into elementary operations — Each primitive (add, multiply, exp, etc.) has a known derivative. The chain rule composes these into derivatives of arbitrary functions.
•Graph structure encodes dependency information — The graph captures exactly how outputs depend on inputs, enabling systematic gradient computation via path tracing.
•Fan-out requires gradient accumulation — When a value is used multiple times, gradients from each use must be summed (multivariate chain rule).
•Static vs. dynamic graphs offer different tradeoffs — Static graphs enable better optimization; dynamic graphs offer flexibility and easier debugging. Modern frameworks blend both approaches.
•Every node needs a gradient function (VJP) — For backpropagation, each operation must specify how to compute gradients with respect to its inputs given the gradient with respect to its output.

What's Next

With the computational graph representation established, we're ready to explore how computation actually flows through these graphs:

Forward Pass Computation: How values propagate from inputs to outputs, respecting topological order
Reverse Mode Autodiff: The backpropagation algorithm that computes gradients efficiently
Topological Ordering: Algorithms for determining valid execution orders
Modern Framework Implementation: How TensorFlow, PyTorch, and JAX implement these concepts

The computational graph is the foundation—now we'll see how it enables the magical automatic gradient computation that powers all of deep learning.

Page Complete

You now understand computational graphs as the fundamental abstraction behind automatic differentiation. This representation—decomposing complex functions into elementary operations organized as a directed acyclic graph—is what makes training neural networks with millions of parameters tractable. Next, we'll explore how the forward pass propagates values through this graph structure.

1 / 5

Loading learning content...

Machine LearningBackpropagation & Optimization

Computational Graphs

LevelIntermediate

Duration90 mins

TopicBackpropagation & Optimization

1 / 5

Graph Representation

The Abstraction Behind Deep Learning

What You Will Learn

The Computational Graph Concept

A computational graph is a directed graph that represents the structure of a mathematical computation. In this graph:

Nodes represent either inputs, intermediate values, or operations
Edges represent data flow from one node to another
The graph structure encodes how outputs depend on inputs through chains of elementary operations

Consider a simple expression: $f(x, y) = (x + y) \cdot \sin(x)$

Instead of treating this as a monolithic function, we decompose it into elementary steps:

Compute $a = x + y$ (addition)
Compute $b = \sin(x)$ (sine function)
Compute $f = a \cdot b$ (multiplication)

The Key Insight

Graph Types and Terminology

Computational graphs can take two primary forms, depending on what nodes represent:

Converting Mermaid diagram...

Formal Definition and Properties

Let us formalize the computational graph concept with mathematical precision. This rigor is essential for understanding automatic differentiation algorithms.

Definition: A computational graph $G = (V, E)$ is a directed acyclic graph (DAG) where:

$V = V_{\text{input}} \cup V_{\text{intermediate}} \cup V_{\text{output}}$ is the set of nodes
$E \subseteq V \times V$ is the set of directed edges
Each non-input node $v \in V_{\text{intermediate}} \cup V_{\text{output}}$ has an associated elementary operation $\phi_v$
For each non-input node $v$, if its parents (nodes with edges pointing to $v$) are ${u_1, u_2, ..., u_k}$, then the value at $v$ is $v = \phi_v(u_1, u_2, ..., u_k)$

The acyclic property is crucial: it guarantees that we can always evaluate the graph by processing nodes in a topological order, where each node's inputs are computed before the node itself.

Essential Properties of Computational Graphs

•Directed Acyclic Structure — The graph must be a DAG. Cycles would create circular dependencies, making values undefined. This ensures every computation can be evaluated in finite time.
•Elementary Operation Decomposition — Every complex function is decomposed into operations from a fixed library (addition, multiplication, exp, log, sin, etc.). Each elementary operation has a known, easily-computed derivative.
•Local Dependency Encoding — Each node depends only on its immediate parents. This locality is key: we never need to understand the entire graph to compute how one node depends on another—just the path between them.
•Single Static Assignment (SSA) Form — Each intermediate value is computed exactly once. This explicit naming of every intermediate result enables efficient gradient tracking.
•Well-Defined Evaluation Order — Any topological sort of the DAG gives a valid evaluation order. The graph structure implicitly encodes all valid orderings.

The Library of Elementary Operations

Category	Operations
Arithmetic	$+$, $-$, $\times$, $\div$, $x^n$
Transcendental	$\exp$, $\log$, $\sin$, $\cos$, $\tanh$
Comparison	$\max$, $\min$, $
Linear Algebra	Matrix multiplication, transpose, inverse
Reduction	Sum, product, mean over axes
Selection	Indexing, slicing, concatenation

For each primitive, the framework must know:

How to compute the forward value given inputs
How to compute partial derivatives with respect to each input

This is the contract that enables automatic differentiation: as long as every primitive satisfies this contract, any composition of primitives can be differentiated automatically.

Why Primitives Must Be Carefully Chosen

Decomposing Complex Functions

Given logits $\mathbf{z} = [z_1, z_2, ..., z_K]$ and true class $y$, the softmax cross-entropy loss is:

$$L = -\log\left(\frac{\exp(z_y)}{\sum_{k=1}^{K} \exp(z_k)}\right)$$

This can be rewritten as:

$$L = -z_y + \log\left(\sum_{k=1}^{K} \exp(z_k)\right)$$

Now let's decompose this into a computational graph:

softmax_cross_entropy_decomposition
Pseudocode
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Decomposition of Softmax Cross-Entropy Loss into Elementary Operations
# Input: logits z = [z_1, z_2, ..., z_K], true class y
 
# Step 1: Compute exp of each logit
e_1 = exp(z_1)
e_2 = exp(z_2)
# ...
e_K = exp(z_K)
 
# Step 2: Sum all exponentials  
s = e_1 + e_2 + ... + e_K
 
# Step 3: Compute log of sum
log_s = log(s)
 
# Step 4: Select the logit for true class
z_y = select(z, y)  # z[y]
 
# Step 5: Negate the true class logit
neg_z_y = -z_y
 
# Step 6: Add to get final loss
L = neg_z_y + log_s
 
# The graph structure:
#   z_i → exp → e_i → (+) → s → log → log_s
#                                        ↓
#   z_y → (select) → neg → neg_z_y → (+) → L

Each operation in this decomposition—exp, log, +, -, select—has a well-defined derivative:

Operation	Forward	Derivative w.r.t. input
$\exp(x)$	$e^x$	$e^x$
$\log(x)$	$\ln(x)$	$1/x$
$x + y$	$x + y$	$1$ for each input
$-x$	$-x$	$-1$
$\text{select}(\mathbf{z}, y)$	$z_y$	$1$ at position $y$, $0$ elsewhere

Numerical Stability Considerations

Decomposition must also consider numerical stability. The naive softmax computation:

$$\text{softmax}(z_i) = \frac{\exp(z_i)}{\sum_j \exp(z_j)}$$

can cause overflow when $z_i$ is large. The stable version uses the log-sum-exp trick:

$$\text{logsumexp}(\mathbf{z}) = z_{\max} + \log\left(\sum_j \exp(z_j - z_{\max})\right)$$

Numerical Stability in Graph Operations

Neural Networks as Computational Graphs

Neural networks are naturally expressed as computational graphs. Each layer, each activation, each loss computation becomes a subgraph within the larger network graph.

Consider a simple two-layer neural network with ReLU activation:

$$\hat{y} = \mathbf{W}_2 \cdot \text{ReLU}(\mathbf{W}_1 \cdot \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2$$ $$L = \frac{1}{2}|y - \hat{y}|^2$$

This decomposes into the following computational graph:

Converting Mermaid diagram...

In this graph:

Blue nodes are inputs and targets (fixed during training)
Yellow nodes are parameters (what we're optimizing)
White nodes are intermediate computations
Red node is the final loss we want to minimize

Every Path Matters

To compute the gradient of the loss $L$ with respect to a parameter like $\mathbf{W}_1$, we must trace all paths from $L$ back to $\mathbf{W}_1$. In this simple network, there's exactly one path:

The gradient is the product of partial derivatives along this path. This is the chain rule at work:

In deeper networks, the number of operations grows, but the principle remains identical: trace paths, multiply derivatives, accumulate contributions.

Parameters vs. Inputs vs. Intermediate Values

Static vs. Dynamic Graphs

Static Graphs (Define-and-Run)

In static graph frameworks, you first define the complete computational graph, then compile it, and finally execute it repeatedly with different data.

The workflow is:

Build graph symbolically (no actual computation)
Compile and optimize the graph
Feed data and execute repeatedly

Advantages:

Optimization opportunities: The framework sees the entire graph before execution, enabling aggressive optimizations (operation fusion, memory planning, parallelization)
Deployment simplicity: Graph can be serialized and deployed without the original Python code
Performance: Compiled graphs execute faster, especially for repeated execution

Disadvantages:

Debugging difficulty: Errors occur in the compiled graph, not your code; stack traces are cryptic
Reduced flexibility: Control flow (if/else, variable-length loops) becomes complex
Steeper learning curve: Must think in terms of symbolic operations, not imperative code

static_graph_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Static Graph Approach (TensorFlow 1.x style)
import tensorflow as tf
 
# Phase 1: Define graph symbolically
x = tf.placeholder(tf.float32, [None, 784])
weights = tf.Variable(tf.random.normal([784, 10]))
logits = tf.matmul(x, weights)
loss = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(
        labels=labels, logits=logits
    )
)
train_op = tf.train.GradientDescentOptimizer(
    learning_rate=0.01
).minimize(loss)
 
# Phase 2: Execute graph
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for batch_x, batch_y in data_loader:
        # Graph executed here
        _, loss_val = sess.run(
            [train_op, loss],
            feed_dict={x: batch_x, labels: batch_y}
        )

dynamic_graph_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Dynamic Graph Approach (PyTorch/TF2 style)
import torch
import torch.nn as nn
 
# Define model (no graph built yet)
model = nn.Linear(784, 10)
optimizer = torch.optim.SGD(
    model.parameters(), lr=0.01
)
loss_fn = nn.CrossEntropyLoss()
 
# Training loop - graph built on each forward
for batch_x, batch_y in data_loader:
    # Graph built here, during execution
    logits = model(batch_x)
    loss = loss_fn(logits, batch_y)
    
    # Backward pass (traverses graph)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    # Graph discarded after each iteration
    # (unless explicitly retained)

Dynamic Graphs (Define-by-Run)

In dynamic graph frameworks, the computational graph is constructed during execution. Each forward pass builds a new graph.

Advantages:

Natural Python control flow: Use regular if/else, for, while — they just work
Easy debugging: Standard debuggers, print statements, and stack traces work normally
Research flexibility: Modify architecture between iterations, ideal for experimentation
Intuitive mental model: Code executes as written, no symbolic abstraction

Disadvantages:

Less optimization opportunity: Framework can't see the whole graph until execution
Graph construction overhead: Building graph on every iteration has cost
Deployment complexity: Need Python runtime (though modern solutions like TorchScript address this)

Modern Convergence

Static vs. Dynamic Graph Comparison
Aspect	Static Graphs	Dynamic Graphs
Graph construction	Before execution (define phase)	During execution (on-the-fly)
Control flow	Special constructs (tf.cond, tf.while_loop)	Native Python (if/else, for/while)
Debugging	Challenging (symbolic errors)	Natural (standard Python debugging)
Optimization potential	High (full graph visible)	Lower (incremental construction)
Deployment	Easy serialization	Requires runtime or compilation
Flexibility	Less flexible	Highly flexible
Learning curve	Steeper	Gentler
Example frameworks	TF 1.x, Theano, Caffe	PyTorch, TF 2.x (eager), JAX

Node and Edge Semantics

Node Semantics

Each node in a computational graph carries:

Value: The result of the computation at this node
Operation: The function applied to compute the value (for non-input nodes)
Parents: References to nodes whose values are inputs to this operation
Gradient function: How to compute gradients with respect to parent values
Shape/dtype metadata: Information about tensor dimensions and data types

node_semantics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# Conceptual node structure for a computational graph
class Node:
    def __init__(self, operation, inputs, value=None):
                        self.operation = operation      # e.g., "multiply", "relu", "matmul"
        self.inputs = inputs            # List of parent Node references
        self.value = value              # Computed during forward pass
        self.grad = None                # Accumulated gradient (set during backward)
        self.shape = None               # Tensor shape
        self.dtype = None               # Data type (float32, etc.)
        
    def compute_forward(self):
        """Compute this node's value from input values."""
        input_values = [inp.value for inp in self.inputs]
        self.value = FORWARD_FUNCTIONS[self.operation](*input_values)
        return self.value
    
    def compute_backward(self, output_grad):
        """
        Given gradient of loss w.r.t. this node's output,
        compute gradients w.r.t. each input.
        
        This is the VJP (Vector-Jacobian Product) operation.
        """
        input_values = [inp.value for inp in self.inputs]
        # Get gradients for each input
        input_grads = VJP_FUNCTIONS[self.operation](
            output_grad, self.value, *input_values
        )
        # Accumulate into input nodes
        for inp, grad in zip(self.inputs, input_grads):
            if inp.grad is None:
                inp.grad = grad
            else:
                inp.grad = inp.grad + grad  # Accumulate!
 
# Example VJP implementation for multiplication
def multiply_vjp(output_grad, output_value, x, y):
    """VJP for z = x * y"""
    # ∂L/∂x = ∂L/∂z * ∂z/∂x = output_grad * y
    # ∂L/∂y = ∂L/∂z * ∂z/∂y = output_grad * x
    grad_x = output_grad * y
    grad_y = output_grad * x
    return (grad_x, grad_y)

Edge Semantics and Fan-Out

Edges represent data dependencies, but a subtle issue arises: fan-out. When one node's output is used by multiple downstream nodes, the gradients from all downstream paths must be summed.

Mathematically, if a value $v$ is used in computing $f_1(v)$ and $f_2(v)$, and ultimately both contribute to loss $L$:

$$\frac{\partial L}{\partial v} = \frac{\partial L}{\partial f_1} \cdot \frac{\partial f_1}{\partial v} + \frac{\partial L}{\partial f_2} \cdot \frac{\partial f_2}{\partial v}$$

This is the multivariate chain rule — gradients from multiple uses must be accumulated. The inp.grad = inp.grad + grad line in the code above implements exactly this accumulation.

Shape Considerations

For example, if we add a bias $b \in \mathbb{R}^{n}$ to each row of $X \in \mathbb{R}^{m \times n}$:

Forward: $Y = X + b$ (broadcasting $b$ across $m$ rows)
Backward: $\frac{\partial L}{\partial b} = \sum_{i=1}^{m} \frac{\partial L}{\partial Y_i}$ (sum gradients across the broadcast dimension)

The Fan-Out Gradient Accumulation Rule

Building Graphs in Practice

Let's implement a minimal computational graph system to solidify understanding. This implementation demonstrates the core concepts without production complexity.

We'll build a system that can:

Track operations as they execute
Store the graph structure
Compute gradients via backpropagation

mini_autograd.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
import numpy as np
 
class Tensor:
    """A value in the computational graph with gradient tracking."""
    
    def __init__(self, data, requires_grad=False, _children=(), _op=''):
        self.data = np.array(data, dtype=np.float32)
        self.requires_grad = requires_grad
        self.grad = None
        
        # Graph structure
        self._children = set(_children)  # Parent nodes
        self._op = _op                   # Operation that created this node
        self._backward = lambda: None    # Gradient function
        
    def __repr__(self):
        return f"Tensor(data={self.data}, grad={self.grad})"
    
    def __add__(self, other):
        other = other if isinstance(other, Tensor) else Tensor(other)
        out = Tensor(
            self.data + other.data,
            requires_grad=self.requires_grad or other.requires_grad,
            _children=(self, other),
            _op='+'
        )
        
        def _backward():
            if self.requires_grad:
                self.grad = (self.grad or 0) + out.grad
            if other.requires_grad:
                other.grad = (other.grad or 0) + out.grad
        out._backward = _backward
        
        return out
    
    def __mul__(self, other):
        other = other if isinstance(other, Tensor) else Tensor(other)
        out = Tensor(
            self.data * other.data,
            requires_grad=self.requires_grad or other.requires_grad,
            _children=(self, other),
            _op='*'
        )
        
        def _backward():
            if self.requires_grad:
                self.grad = (self.grad or 0) + other.data * out.grad
            if other.requires_grad:
                other.grad = (other.grad or 0) + self.data * out.grad
        out._backward = _backward
        
        return out
    
    def relu(self):
        out = Tensor(
            np.maximum(0, self.data),
            requires_grad=self.requires_grad,
            _children=(self,),
            _op='relu'
        )
        
        def _backward():
            if self.requires_grad:
                self.grad = (self.grad or 0) + (out.data > 0) * out.grad
        out._backward = _backward
        
        return out
    
    def sum(self):
        out = Tensor(
            self.data.sum(),
            requires_grad=self.requires_grad,
            _children=(self,),
            _op='sum'
        )
        
        def _backward():
            if self.requires_grad:
                self.grad = (self.grad or 0) + np.ones_like(self.data) * out.grad
        out._backward = _backward
        
        return out
    
    def backward(self):
        """Compute gradients via reverse-mode autodiff."""
        # Topological sort
        topo = []
        visited = set()
        
        def build_topo(tensor):
            if tensor not in visited:
                visited.add(tensor)
                for child in tensor._children:
                    build_topo(child)
                topo.append(tensor)
        
        build_topo(self)
        
        # Initialize gradient of output
        self.grad = np.ones_like(self.data)
        
        # Backpropagate in reverse topological order
        for tensor in reversed(topo):
            tensor._backward()

using_mini_autograd.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Using our mini autograd system
# Compute f(x, y) = sum(relu(x * y + 2))
 
x = Tensor([1.0, 2.0, 3.0], requires_grad=True)
y = Tensor([4.0, 5.0, 6.0], requires_grad=True)
 
# Forward pass (graph built implicitly)
z = x * y           # [4, 10, 18]
z = z + 2           # [6, 12, 20]
z = z.relu()        # [6, 12, 20] (all positive, ReLU is identity)
loss = z.sum()      # 38
 
print(f"Forward result: {loss.data}")  # 38.0
 
# Backward pass
loss.backward()
 
print(f"∂L/∂x = {x.grad}")  # [4, 5, 6] = y (since ReLU didn't zero anything)
print(f"∂L/∂y = {y.grad}")  # [1, 2, 3] = x
 
# Verify manually:
# L = sum(relu(x*y + 2))
# ∂L/∂x_i = ∂L/∂z_i * ∂z_i/∂(xy)_i * ∂(xy)_i/∂x_i
#         = 1 * 1 * y_i = y_i  (since relu(.) > 0)

The Magic of Implicit Graph Construction

Summary and Looking Ahead

Key Takeaways

•Computational graphs are directed acyclic graphs — Nodes represent values or operations; edges represent data dependencies. The acyclic property ensures well-defined evaluation order.
•Complex functions decompose into elementary operations — Each primitive (add, multiply, exp, etc.) has a known derivative. The chain rule composes these into derivatives of arbitrary functions.
•Graph structure encodes dependency information — The graph captures exactly how outputs depend on inputs, enabling systematic gradient computation via path tracing.
•Fan-out requires gradient accumulation — When a value is used multiple times, gradients from each use must be summed (multivariate chain rule).
•Static vs. dynamic graphs offer different tradeoffs — Static graphs enable better optimization; dynamic graphs offer flexibility and easier debugging. Modern frameworks blend both approaches.
•Every node needs a gradient function (VJP) — For backpropagation, each operation must specify how to compute gradients with respect to its inputs given the gradient with respect to its output.

What's Next

With the computational graph representation established, we're ready to explore how computation actually flows through these graphs:

Forward Pass Computation: How values propagate from inputs to outputs, respecting topological order
Reverse Mode Autodiff: The backpropagation algorithm that computes gradients efficiently
Topological Ordering: Algorithms for determining valid execution orders
Modern Framework Implementation: How TensorFlow, PyTorch, and JAX implement these concepts

The computational graph is the foundation—now we'll see how it enables the magical automatic gradient computation that powers all of deep learning.

Page Complete

1 / 5