0/318

00:00:00

Description

Editorial

Two-Layer Feed-Forward Network with Skip Connection and Dropout

MEDIUM25 pts

In modern deep learning architectures, particularly Transformer models, a critical component is the position-wise feed-forward sub-layer. This module enhances the representational capacity of the network by applying independent non-linear transformations to each position in a sequence.

The feed-forward sub-layer consists of multiple key elements working together:

Component Architecture

1. Two-Layer Dense Network

The core transformation applies two successive linear projections with a non-linearity in between:

First projection: Expands the input dimension to a larger hidden dimension
Activation function: ReLU (Rectified Linear Unit) introduces non-linearity
Second projection: Projects back to the original dimension

Mathematically, for an input vector x: $$\text{hidden} = \text{ReLU}(W_1 \cdot x + b_1)$$ $$\text{ffn_output} = W_2 \cdot \text{hidden} + b_2$$

2. Dropout Regularization

To prevent overfitting and improve generalization, dropout is applied after the feed-forward computation. During training, each neuron's output is randomly set to zero with probability p, and the remaining outputs are scaled by 1/(1-p) to maintain expected values (inverted dropout).

3. Residual Skip Connection

The original input x is added directly to the transformed output, creating a residual pathway: $$\text{output} = x + \text{Dropout}(\text{ffn_output})$$

This skip connection enables gradient flow during backpropagation and allows the network to learn identity mappings easily.

Your Task

Implement a function that performs the complete feed-forward residual block computation:

Compute the hidden layer: Apply the first linear transformation (W₁x + b₁), then apply ReLU activation (max(0, value) for each element)
Compute the feed-forward output: Apply the second linear transformation (W₂ · hidden + b₂)
Apply dropout: Using the provided seed for reproducibility, randomly zero out elements with probability dropout_p and scale surviving elements
Add residual connection: Add the original input to the dropout-regularized output
Round results: Return the final vector with each element rounded to 4 decimal places

Note on Dropout Implementation:

Use np.random.seed(seed) for reproducibility
Generate a mask using np.random.rand() and compare against the dropout probability
When dropout_p = 0, no dropout is applied (mask is all ones)
Use inverted dropout: scale by 1/(1-dropout_p) during training

Example

Input

x = [1.0, -1.0]
W1 = [[1.0, 2.0], [3.0, 4.0]]
b1 = [0.5, -0.5]
W2 = [[2.0, 1.0], [0.5, 1.0]]
b2 = [0.0, 0.5]
dropout_p = 0.0
seed = 42

Output

[1.0, -0.5]

Explanation

Step 1: First Linear Layer + ReLU • hidden_pre = W₁ · x + b₁ • For first neuron: (1.0 × 1.0) + (2.0 × -1.0) + 0.5 = 1.0 - 2.0 + 0.5 = -0.5 • For second neuron: (3.0 × 1.0) + (4.0 × -1.0) + (-0.5) = 3.0 - 4.0 - 0.5 = -1.5 • hidden_pre = [-0.5, -1.5] • hidden = ReLU([-0.5, -1.5]) = [0.0, 0.0]

Step 2: Second Linear Layer • ffn_output = W₂ · hidden + b₂ = [0.0, 0.0] + [0.0, 0.5] = [0.0, 0.5]

Step 3: Dropout • With dropout_p = 0.0, no elements are dropped • after_dropout = [0.0, 0.5]

Step 4: Residual Connection • output = x + after_dropout = [1.0, -1.0] + [0.0, 0.5] = [1.0, -0.5]

Example

Input

x = [1.0, 2.0]
W1 = [[1.0, 0.0], [0.0, 1.0]]
b1 = [0.0, 0.0]
W2 = [[1.0, 0.0], [0.0, 1.0]]
b2 = [0.0, 0.0]
dropout_p = 0.0
seed = 42

Output

[2.0, 4.0]

Explanation

Step 1: First Linear Layer + ReLU • With identity weight matrices (W₁ = I) and zero biases: • hidden_pre = I · [1.0, 2.0] + [0.0, 0.0] = [1.0, 2.0] • hidden = ReLU([1.0, 2.0]) = [1.0, 2.0] (all positive, unchanged)

Step 2: Second Linear Layer • ffn_output = I · [1.0, 2.0] + [0.0, 0.0] = [1.0, 2.0]

Step 3: Dropout • No dropout applied (p = 0.0)

Step 4: Residual Connection • output = [1.0, 2.0] + [1.0, 2.0] = [2.0, 4.0]

This example demonstrates how identity transformations with residual connections effectively double the input signal.

Example

Input

x = [0.5, 0.5, 0.5]
W1 = [[1.0, 1.0, 1.0], [1.0, 1.0, 1.0], [1.0, 1.0, 1.0]]
b1 = [0.0, 0.0, 0.0]
W2 = [[0.5, 0.5, 0.5], [0.5, 0.5, 0.5], [0.5, 0.5, 0.5]]
b2 = [0.1, 0.2, 0.3]
dropout_p = 0.0
seed = 123

Output

[2.85, 2.95, 3.05]

Explanation

Step 1: First Linear Layer + ReLU • Each row of W₁ sums the input: 0.5 + 0.5 + 0.5 = 1.5 • hidden_pre = [1.5, 1.5, 1.5] • hidden = ReLU([1.5, 1.5, 1.5]) = [1.5, 1.5, 1.5] (all positive)

Step 2: Second Linear Layer • Each row computes: 0.5 × 1.5 + 0.5 × 1.5 + 0.5 × 1.5 = 2.25 • Plus biases: [2.25 + 0.1, 2.25 + 0.2, 2.25 + 0.3] = [2.35, 2.45, 2.55]

Step 3: Dropout • No dropout (p = 0.0)

Step 4: Residual Connection • output = [0.5, 0.5, 0.5] + [2.35, 2.45, 2.55] = [2.85, 2.95, 3.05]

Accepted0/0·0% Acceptance

Constraints

1 ≤ dimension of input vector ≤ 512
Hidden dimension matches the number of rows in W₁ and columns in W₂
W₁ has shape (hidden_dim, input_dim), W₂ has shape (input_dim, hidden_dim)
0.0 ≤ dropout_p ≤ 1.0 (dropout probability)
-10⁶ ≤ weight and bias values ≤ 10⁶
The seed is a positive integer for reproducible random number generation
Output values should be rounded to exactly 4 decimal places

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

x =

[1,2]

W1 =

[[1,0],[0,1]]

W2 =

[[1,0],[0,1]]

b1 =

[0,0]

b2 =

[0,0]

seed =

dropout_p =

Component Architecture

1. Two-Layer Dense Network

The core transformation applies two successive linear projections with a non-linearity in between:

First projection: Expands the input dimension to a larger hidden dimension

Activation function: ReLU (Rectified Linear Unit) introduces non-linearity

Second projection: Projects back to the original dimension

Mathematically, for an input vector x: $$\text{hidden} = \text{ReLU}(W_1 \cdot x + b_1)$$ $$\text{ffn_output} = W_2 \cdot \text{hidden} + b_2$$

2. Dropout Regularization

3. Residual Skip Connection

The original input x is added directly to the transformed output, creating a residual pathway: $$\text{output} = x + \text{Dropout}(\text{ffn_output})$$

This skip connection enables gradient flow during backpropagation and allows the network to learn identity mappings easily.

Your Task

Implement a function that performs the complete feed-forward residual block computation:

Compute the hidden layer: Apply the first linear transformation (W₁x + b₁), then apply ReLU activation (max(0, value) for each element)

Compute the feed-forward output: Apply the second linear transformation (W₂ · hidden + b₂)

Apply dropout: Using the provided seed for reproducibility, randomly zero out elements with probability dropout_p and scale surviving elements

Add residual connection: Add the original input to the dropout-regularized output

Round results: Return the final vector with each element rounded to 4 decimal places

Note on Dropout Implementation:

Use np.random.seed(seed) for reproducibility

Generate a mask using np.random.rand() and compare against the dropout probability

When dropout_p = 0, no dropout is applied (mask is all ones)

Use inverted dropout: scale by 1/(1-dropout_p) during training

Two-Layer Feed-Forward Network with Skip Connection and Dropout

Component Architecture

1. Two-Layer Dense Network

2. Dropout Regularization

3. Residual Skip Connection

Your Task

Hints

Two-Layer Feed-Forward Network with Skip Connection and Dropout

Component Architecture

1. Two-Layer Dense Network

2. Dropout Regularization

3. Residual Skip Connection

Your Task

Hints