Loading content...
In the realm of neural network optimization, gradient computation forms the backbone of the learning process. During backpropagation, gradients (derivatives) of activation functions are essential for calculating how network weights should be adjusted to minimize the loss function.
Activation functions introduce non-linearity into neural networks, enabling them to learn complex patterns. However, to train these networks effectively, we must understand how small changes in the input to an activation function affect its output—this is precisely what the derivative tells us.
The Three Core Activation Functions:
1. Sigmoid Activation (σ): $$\sigma(x) = \frac{1}{1 + e^{-x}}$$
The sigmoid function squashes input values to the range (0, 1), making it historically popular for output layers in binary classification. Its derivative has an elegant closed form: $$\sigma'(x) = \sigma(x) \cdot (1 - \sigma(x))$$
2. Hyperbolic Tangent (Tanh): $$\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$$
Tanh maps inputs to the range (-1, 1), offering zero-centered outputs which can aid gradient flow. Its derivative is: $$\tanh'(x) = 1 - \tanh^2(x)$$
3. Rectified Linear Unit (ReLU): $$\text{ReLU}(x) = \max(0, x)$$
ReLU has become the default activation for hidden layers in modern deep learning due to its computational efficiency and ability to mitigate the vanishing gradient problem. Its derivative is simply: $$\text{ReLU}'(x) = \begin{cases} 1 & \text{if } x > 0 \ 0 & \text{if } x \leq 0 \end{cases}$$
Your Task:
Implement a Python function that computes the derivatives of all three activation functions at a given input value x. Return a dictionary with keys 'sigmoid', 'tanh', and 'relu' containing the respective derivative values. Round the sigmoid derivative to 4 decimal places and the tanh derivative to 2 decimal places.
x = 0.0{'sigmoid': 0.25, 'tanh': 1.0, 'relu': 0.0}At x = 0:
Sigmoid derivative: • σ(0) = 1 / (1 + e⁰) = 1 / 2 = 0.5 • σ'(0) = 0.5 × (1 - 0.5) = 0.5 × 0.5 = 0.25
Tanh derivative: • tanh(0) = 0 • tanh'(0) = 1 - 0² = 1.0
ReLU derivative: • Since x = 0 is not greater than 0, ReLU'(0) = 0.0
Note: At x = 0, the tanh derivative reaches its maximum value of 1.0, while sigmoid derivative peaks at 0.25. This illustrates why tanh historically had better gradient flow than sigmoid.
x = 1.0{'sigmoid': 0.1966, 'tanh': 0.42, 'relu': 1.0}At x = 1.0:
Sigmoid derivative: • σ(1) = 1 / (1 + e⁻¹) ≈ 0.7311 • σ'(1) = 0.7311 × (1 - 0.7311) ≈ 0.7311 × 0.2689 ≈ 0.1966
Tanh derivative: • tanh(1) ≈ 0.7616 • tanh'(1) = 1 - 0.7616² ≈ 1 - 0.58 ≈ 0.42
ReLU derivative: • Since x = 1.0 > 0, ReLU'(1) = 1.0
Observe how both sigmoid and tanh derivatives are already decreasing from their peaks, while ReLU maintains a constant gradient of 1.0 in the positive region.
x = -1.0{'sigmoid': 0.1966, 'tanh': 0.42, 'relu': 0.0}At x = -1.0:
Sigmoid derivative: • σ(-1) = 1 / (1 + e¹) ≈ 0.2689 • σ'(-1) = 0.2689 × (1 - 0.2689) ≈ 0.2689 × 0.7311 ≈ 0.1966
Note: Due to symmetry, σ'(-x) = σ'(x), so the derivative at x = -1 equals the derivative at x = 1.
Tanh derivative: • tanh(-1) ≈ -0.7616 • tanh'(-1) = 1 - (-0.7616)² = 1 - 0.58 ≈ 0.42
Similarly, tanh' is symmetric: tanh'(-x) = tanh'(x).
ReLU derivative: • Since x = -1.0 ≤ 0, ReLU'(-1) = 0.0
This demonstrates the "dying ReLU" phenomenon—neurons with negative pre-activations have zero gradient and cannot learn.
Constraints