Self-Gating Smooth Activation: The Mish Function (Easy) — Practice with Code Visualizer

In deep learning, activation functions play a critical role in enabling neural networks to learn complex, non-linear patterns. While ReLU has dominated the field for years, researchers have continuously sought activation functions with superior properties for gradient flow and generalization.

The Mish Activation Function is a modern, self-regularizing activation function that has demonstrated remarkable performance improvements over traditional activations like ReLU and Swish in various deep learning architectures. Introduced as part of the quest for better gradient dynamics, Mish combines self-gating mechanisms with smooth, unbounded characteristics that allow it to preserve small negative gradients rather than eliminating them entirely.

Mathematical Definition:

The Mish function is defined as:

$$\text{Mish}(x) = x \cdot \tanh(\text{softplus}(x))$$

Where the softplus function is a smooth approximation of the ReLU function:

$$\text{softplus}(x) = \ln(1 + e^x)$$

The complete expanded formula becomes:

$$\text{Mish}(x) = x \cdot \tanh(\ln(1 + e^x))$$

Key Properties of Mish:

Self-Gating: The function uses its own transformed input as a gating mechanism, similar to attention mechanisms in transformers
Smooth and Non-Monotonic: Unlike ReLU, Mish is infinitely differentiable and has a smooth curve with a small dip into negative territory
Unbounded Above: For positive inputs, Mish approaches the identity function, allowing strong gradient flow
Bounded Below: The minimum value is approximately -0.31 at x ≈ -1.2
Preserves Negative Gradients: Small negative values still produce meaningful outputs, unlike ReLU which zeros them completely

Your Task:

Implement a function that computes the Mish activation value for a given scalar input. The function should return the result rounded to 4 decimal places.

Computational Steps:

Calculate the softplus value: (\text{softplus}(x) = \ln(1 + e^x))
Apply the hyperbolic tangent: (\tanh(\text{softplus}(x)))
Multiply by the original input: (x \cdot \tanh(\text{softplus}(x)))
Round the result to 4 decimal places

For x = 1.0:

Step 1: Calculate softplus(1.0) = ln(1 + e¹) = ln(1 + 2.7183) = ln(3.7183) ≈ 1.3133

Step 2: Apply tanh: tanh(1.3133) ≈ 0.8651

Step 3: Multiply by input: 1.0 × 0.8651 = 0.8651

The Mish activation value for x = 1.0 is 0.8651. Notice how the output is close to but slightly less than the input, demonstrating Mish's smooth gating behavior for positive values.

For x = 0.0:

Step 1: Calculate softplus(0.0) = ln(1 + e⁰) = ln(2) ≈ 0.6931

Step 2: Apply tanh: tanh(0.6931) ≈ 0.6

Step 3: Multiply by input: 0.0 × 0.6 = 0.0

The Mish activation value for x = 0.0 is 0.0. Since the final formula multiplies by x, any input of zero produces an output of zero, regardless of the intermediate softplus and tanh computations.

For x = -1.0:

Step 1: Calculate softplus(-1.0) = ln(1 + e⁻¹) = ln(1 + 0.3679) = ln(1.3679) ≈ 0.3133

Step 2: Apply tanh: tanh(0.3133) ≈ 0.3034

Step 3: Multiply by input: (-1.0) × 0.3034 = -0.3034

The Mish activation value for x = -1.0 is -0.3034. This demonstrates one of Mish's key advantages: unlike ReLU which would output 0, Mish preserves a small negative gradient, allowing information to flow even for negative inputs.