Loading content...
In modern deep learning architectures, particularly Transformer models, a critical component is the position-wise feed-forward sub-layer. This module enhances the representational capacity of the network by applying independent non-linear transformations to each position in a sequence.
The feed-forward sub-layer consists of multiple key elements working together:
The core transformation applies two successive linear projections with a non-linearity in between:
Mathematically, for an input vector x: $$\text{hidden} = \text{ReLU}(W_1 \cdot x + b_1)$$ $$\text{ffn_output} = W_2 \cdot \text{hidden} + b_2$$
To prevent overfitting and improve generalization, dropout is applied after the feed-forward computation. During training, each neuron's output is randomly set to zero with probability p, and the remaining outputs are scaled by 1/(1-p) to maintain expected values (inverted dropout).
The original input x is added directly to the transformed output, creating a residual pathway: $$\text{output} = x + \text{Dropout}(\text{ffn_output})$$
This skip connection enables gradient flow during backpropagation and allows the network to learn identity mappings easily.
Implement a function that performs the complete feed-forward residual block computation:
dropout_p and scale surviving elementsNote on Dropout Implementation:
np.random.seed(seed) for reproducibilitynp.random.rand() and compare against the dropout probability1/(1-dropout_p) during trainingx = [1.0, -1.0]
W1 = [[1.0, 2.0], [3.0, 4.0]]
b1 = [0.5, -0.5]
W2 = [[2.0, 1.0], [0.5, 1.0]]
b2 = [0.0, 0.5]
dropout_p = 0.0
seed = 42[1.0, -0.5]Step 1: First Linear Layer + ReLU • hidden_pre = W₁ · x + b₁ • For first neuron: (1.0 × 1.0) + (2.0 × -1.0) + 0.5 = 1.0 - 2.0 + 0.5 = -0.5 • For second neuron: (3.0 × 1.0) + (4.0 × -1.0) + (-0.5) = 3.0 - 4.0 - 0.5 = -1.5 • hidden_pre = [-0.5, -1.5] • hidden = ReLU([-0.5, -1.5]) = [0.0, 0.0]
Step 2: Second Linear Layer • ffn_output = W₂ · hidden + b₂ = [0.0, 0.0] + [0.0, 0.5] = [0.0, 0.5]
Step 3: Dropout • With dropout_p = 0.0, no elements are dropped • after_dropout = [0.0, 0.5]
Step 4: Residual Connection • output = x + after_dropout = [1.0, -1.0] + [0.0, 0.5] = [1.0, -0.5]
x = [1.0, 2.0]
W1 = [[1.0, 0.0], [0.0, 1.0]]
b1 = [0.0, 0.0]
W2 = [[1.0, 0.0], [0.0, 1.0]]
b2 = [0.0, 0.0]
dropout_p = 0.0
seed = 42[2.0, 4.0]Step 1: First Linear Layer + ReLU • With identity weight matrices (W₁ = I) and zero biases: • hidden_pre = I · [1.0, 2.0] + [0.0, 0.0] = [1.0, 2.0] • hidden = ReLU([1.0, 2.0]) = [1.0, 2.0] (all positive, unchanged)
Step 2: Second Linear Layer • ffn_output = I · [1.0, 2.0] + [0.0, 0.0] = [1.0, 2.0]
Step 3: Dropout • No dropout applied (p = 0.0)
Step 4: Residual Connection • output = [1.0, 2.0] + [1.0, 2.0] = [2.0, 4.0]
This example demonstrates how identity transformations with residual connections effectively double the input signal.
x = [0.5, 0.5, 0.5]
W1 = [[1.0, 1.0, 1.0], [1.0, 1.0, 1.0], [1.0, 1.0, 1.0]]
b1 = [0.0, 0.0, 0.0]
W2 = [[0.5, 0.5, 0.5], [0.5, 0.5, 0.5], [0.5, 0.5, 0.5]]
b2 = [0.1, 0.2, 0.3]
dropout_p = 0.0
seed = 123[2.85, 2.95, 3.05]Step 1: First Linear Layer + ReLU • Each row of W₁ sums the input: 0.5 + 0.5 + 0.5 = 1.5 • hidden_pre = [1.5, 1.5, 1.5] • hidden = ReLU([1.5, 1.5, 1.5]) = [1.5, 1.5, 1.5] (all positive)
Step 2: Second Linear Layer • Each row computes: 0.5 × 1.5 + 0.5 × 1.5 + 0.5 × 1.5 = 2.25 • Plus biases: [2.25 + 0.1, 2.25 + 0.2, 2.25 + 0.3] = [2.35, 2.45, 2.55]
Step 3: Dropout • No dropout (p = 0.0)
Step 4: Residual Connection • output = [0.5, 0.5, 0.5] + [2.35, 2.45, 2.55] = [2.85, 2.95, 3.05]
Constraints