Loading problem...
Stochastic Neuron Deactivation is a powerful regularization technique that combats overfitting in deep neural networks by randomly "silencing" a proportion of neurons during each training iteration. This technique, widely known as dropout, forces networks to learn more robust and distributed representations by preventing co-adaptation among neurons.
During the forward pass in training mode:
This scaling—called inverted dropout—is crucial because it ensures that the expected value of each activation remains unchanged between training and inference. Without scaling, the network would behave differently at test time, leading to degraded performance.
$$\text{output}_i = \begin{cases} \frac{x_i}{1-p} & \text{if mask}_i = 1 \ 0 & \text{if mask}_i = 0 \end{cases}$$
During the backward pass, the same binary mask must be applied to propagate gradients correctly:
$$\text{grad_output}_i = \begin{cases} \frac{\text{grad}_i}{1-p} & \text{if mask}_i = 1 \ 0 & \text{if mask}_i = 0 \end{cases}$$
During inference mode (training=False), no deactivation occurs—the input and gradients pass through unchanged. This is because we want deterministic, consistent predictions at test time.
Implement the stochastic_deactivation function that:
Important: Use the provided seed with NumPy's random generator for reproducibility. Round results to 4 decimal places.
x = [1.0, 2.0, 3.0, 4.0]
grad = [0.1, 0.2, 0.3, 0.4]
p = 0.5
training = True
seed = 42{'output': [0.0, 4.0, 6.0, 8.0], 'grad_output': [0.0, 0.4, 0.6, 0.8]}Using seed 42 with probability p=0.5, the random mask happens to be [0, 1, 1, 1] (first element deactivated).
Forward pass: • Element 0: masked out → 0.0 • Element 1: 2.0 × (1/0.5) = 2.0 × 2.0 = 4.0 • Element 2: 3.0 × (1/0.5) = 3.0 × 2.0 = 6.0 • Element 3: 4.0 × (1/0.5) = 4.0 × 2.0 = 8.0
Backward pass (same mask): • Element 0: masked out → 0.0 • Element 1: 0.2 × 2.0 = 0.4 • Element 2: 0.3 × 2.0 = 0.6 • Element 3: 0.4 × 2.0 = 0.8
x = [[1.0, 2.0], [3.0, 4.0]]
grad = [[0.1, 0.2], [0.3, 0.4]]
p = 0.3
training = True
seed = 42{'output': [[1.4286, 2.8571], [4.2857, 5.7143]], 'grad_output': [[0.1429, 0.2857], [0.4286, 0.5714]]}With p=0.3 and seed=42, no elements happen to be deactivated in this case. The scaling factor is 1/(1-0.3) = 1.4286.
Forward pass: Each element is scaled by 1.4286 • 1.0 × 1.4286 = 1.4286 • 2.0 × 1.4286 = 2.8571 • 3.0 × 1.4286 = 4.2857 • 4.0 × 1.4286 = 5.7143
Backward pass: Same scaling applied to gradients • 0.1 × 1.4286 = 0.1429 • 0.2 × 1.4286 = 0.2857 • 0.3 × 1.4286 = 0.4286 • 0.4 × 1.4286 = 0.5714
x = [1.0, 2.0, 3.0, 4.0, 5.0]
grad = [0.1, 0.2, 0.3, 0.4, 0.5]
p = 0.5
training = False
seed = 42{'output': [1.0, 2.0, 3.0, 4.0, 5.0], 'grad_output': [0.1, 0.2, 0.3, 0.4, 0.5]}When training=False (inference mode), no deactivation or scaling is applied. The input and gradient tensors pass through completely unchanged, regardless of the dropout probability p. This ensures deterministic behavior during model evaluation and deployment.
Constraints