Stochastic Neuron Deactivation for Network Regularization (Medium) — Practice with Code Visualizer

Stochastic Neuron Deactivation is a powerful regularization technique that combats overfitting in deep neural networks by randomly "silencing" a proportion of neurons during each training iteration. This technique, widely known as dropout, forces networks to learn more robust and distributed representations by preventing co-adaptation among neurons.

Core Mechanism

During the forward pass in training mode:

Generate a binary mask where each element is set to 0 with probability p (the deactivation rate) and 1 otherwise
Apply this mask element-wise to the input tensor, effectively zeroing out random elements
Scale the remaining (non-zero) values by the factor 1/(1-p) to maintain the expected sum of activations

This scaling—called inverted dropout—is crucial because it ensures that the expected value of each activation remains unchanged between training and inference. Without scaling, the network would behave differently at test time, leading to degraded performance.

$$\text{output}_i = \begin{cases} \frac{x_i}{1-p} & \text{if mask}_i = 1 \ 0 & \text{if mask}_i = 0 \end{cases}$$

During the backward pass, the same binary mask must be applied to propagate gradients correctly:

$$\text{grad_output}_i = \begin{cases} \frac{\text{grad}_i}{1-p} & \text{if mask}_i = 1 \ 0 & \text{if mask}_i = 0 \end{cases}$$

During inference mode (training=False), no deactivation occurs—the input and gradients pass through unchanged. This is because we want deterministic, consistent predictions at test time.

Your Task

Implement the stochastic_deactivation function that:

Accepts an input tensor, gradient tensor, deactivation probability, training flag, and random seed
In training mode: generates a reproducible random mask, applies it to both forward and backward passes with proper scaling
In inference mode: returns inputs and gradients unchanged
Returns a dictionary with 'output' (forward result) and 'grad_output' (backward result)

Important: Use the provided seed with NumPy's random generator for reproducibility. Round results to 4 decimal places.

Using seed 42 with probability p=0.5, the random mask happens to be [0, 1, 1, 1] (first element deactivated).

Forward pass: • Element 0: masked out → 0.0 • Element 1: 2.0 × (1/0.5) = 2.0 × 2.0 = 4.0 • Element 2: 3.0 × (1/0.5) = 3.0 × 2.0 = 6.0 • Element 3: 4.0 × (1/0.5) = 4.0 × 2.0 = 8.0

Backward pass (same mask): • Element 0: masked out → 0.0 • Element 1: 0.2 × 2.0 = 0.4 • Element 2: 0.3 × 2.0 = 0.6 • Element 3: 0.4 × 2.0 = 0.8

With p=0.3 and seed=42, no elements happen to be deactivated in this case. The scaling factor is 1/(1-0.3) = 1.4286.

Forward pass: Each element is scaled by 1.4286 • 1.0 × 1.4286 = 1.4286 • 2.0 × 1.4286 = 2.8571 • 3.0 × 1.4286 = 4.2857 • 4.0 × 1.4286 = 5.7143

Backward pass: Same scaling applied to gradients • 0.1 × 1.4286 = 0.1429 • 0.2 × 1.4286 = 0.2857 • 0.3 × 1.4286 = 0.4286 • 0.4 × 1.4286 = 0.5714

When training=False (inference mode), no deactivation or scaling is applied. The input and gradient tensors pass through completely unchanged, regardless of the dropout probability p. This ensures deterministic behavior during model evaluation and deployment.

Core Mechanism

During the forward pass in training mode:

Generate a binary mask where each element is set to 0 with probability p (the deactivation rate) and 1 otherwise
Apply this mask element-wise to the input tensor, effectively zeroing out random elements
Scale the remaining (non-zero) values by the factor 1/(1-p) to maintain the expected sum of activations

$$\text{output}_i = \begin{cases} \frac{x_i}{1-p} & \text{if mask}_i = 1 \ 0 & \text{if mask}_i = 0 \end{cases}$$

During the backward pass, the same binary mask must be applied to propagate gradients correctly:

$$\text{grad_output}_i = \begin{cases} \frac{\text{grad}_i}{1-p} & \text{if mask}_i = 1 \ 0 & \text{if mask}_i = 0 \end{cases}$$

During inference mode (training=False), no deactivation occurs—the input and gradients pass through unchanged. This is because we want deterministic, consistent predictions at test time.

Your Task

Implement the stochastic_deactivation function that:

Accepts an input tensor, gradient tensor, deactivation probability, training flag, and random seed
In training mode: generates a reproducible random mask, applies it to both forward and backward passes with proper scaling
In inference mode: returns inputs and gradients unchanged
Returns a dictionary with 'output' (forward result) and 'grad_output' (backward result)

Important: Use the provided seed with NumPy's random generator for reproducibility. Round results to 4 decimal places.

Using seed 42 with probability p=0.5, the random mask happens to be [0, 1, 1, 1] (first element deactivated).

Forward pass: • Element 0: masked out → 0.0 • Element 1: 2.0 × (1/0.5) = 2.0 × 2.0 = 4.0 • Element 2: 3.0 × (1/0.5) = 3.0 × 2.0 = 6.0 • Element 3: 4.0 × (1/0.5) = 4.0 × 2.0 = 8.0

Backward pass (same mask): • Element 0: masked out → 0.0 • Element 1: 0.2 × 2.0 = 0.4 • Element 2: 0.3 × 2.0 = 0.6 • Element 3: 0.4 × 2.0 = 0.8

With p=0.3 and seed=42, no elements happen to be deactivated in this case. The scaling factor is 1/(1-0.3) = 1.4286.

Forward pass: Each element is scaled by 1.4286 • 1.0 × 1.4286 = 1.4286 • 2.0 × 1.4286 = 2.8571 • 3.0 × 1.4286 = 4.2857 • 4.0 × 1.4286 = 5.7143

Backward pass: Same scaling applied to gradients • 0.1 × 1.4286 = 0.1429 • 0.2 × 1.4286 = 0.2857 • 0.3 × 1.4286 = 0.4286 • 0.4 × 1.4286 = 0.5714

Stochastic Neuron Deactivation for Network Regularization

Core Mechanism

Your Task

Hints

Stochastic Neuron Deactivation for Network Regularization

Core Mechanism

Your Task

Hints