Loading content...
Proper weight initialization is a cornerstone of training deep neural networks successfully. When weights are initialized poorly—either too small or too large—networks suffer from vanishing or exploding gradients, making learning extremely slow or impossible. The Kaiming initialization strategy (also known as He initialization, named after Kaiming He) was specifically engineered to address these challenges for networks using ReLU (Rectified Linear Unit) and its variants as activation functions.
Traditional random initialization methods often fail when applied to deep networks with ReLU activations. The ReLU function zeroes out negative values, effectively halving the variance of activations as signals propagate through layers. Without compensation, this variance shrinkage compounds exponentially, causing:
Kaiming initialization counteracts variance decay by scaling initial weights based on the fan dimension of each layer. The key insight is to maintain consistent variance across layers by scaling weights proportionally to (\sqrt{\frac{2}{n}}), where (n) is the fan dimension.
Normal Distribution: Weights are sampled from a Gaussian distribution with zero mean and standard deviation calculated as:
$$\sigma = \sqrt{\frac{2}{\text{fan}}}$$
Uniform Distribution: Weights are sampled uniformly from the interval ([-\text{bound}, +\text{bound}]) where:
$$\text{bound} = \sqrt{\frac{6}{\text{fan}}}$$
Implement a function kaiming_weight_setup(n_in, n_out, mode, distribution, seed) that generates a weight matrix initialized according to the Kaiming strategy:
'fan_in' or 'fan_out' to determine the scaling dimension'normal' for Gaussian sampling or 'uniform' for uniform distributionThe function should return a NumPy array of shape (n_in, n_out) containing the initialized weights, with all values rounded to 4 decimal places.
n_in = 3, n_out = 2, mode = 'fan_in', distribution = 'normal', seed = 42[[0.4056, -0.1129], [0.5288, 1.2435], [-0.1912, -0.1912]]With 'fan_in' mode and n_in = 3, the fan value is 3. For the normal distribution, we compute the standard deviation as σ = √(2/3) ≈ 0.8165. Using seed 42, NumPy's random number generator produces standard normal samples which are then scaled by σ.
• The first sample from randn (≈0.4967) becomes 0.4967 × 0.8165 ≈ 0.4056 • The second sample (≈-0.1383) becomes -0.1383 × 0.8165 ≈ -0.1129
This scaling ensures that the variance of activations remains stable as signals propagate through ReLU layers.
n_in = 3, n_out = 2, mode = 'fan_in', distribution = 'uniform', seed = 42[[-0.3549, 1.2748], [0.6562, 0.279], [-0.9729, -0.973]]With fan_in mode and n_in = 3, we calculate the bound as √(6/3) = √2 ≈ 1.4142. Uniform random samples from [0, 1) are transformed to the range [-1.4142, 1.4142]. Using seed 42, the first uniform sample (≈0.3745) is mapped to 2 × 0.3745 × 1.4142 - 1.4142 ≈ -0.3549. The uniform distribution provides potentially better conditioning for certain network architectures and optimization scenarios.
n_in = 3, n_out = 4, mode = 'fan_out', distribution = 'normal', seed = 42[[0.3512, -0.0978, 0.458, 1.0769], [-0.1656, -0.1656, 1.1167, 0.5427], [-0.332, 0.3836, -0.3277, -0.3293]]Switching to 'fan_out' mode with n_out = 4, the fan value becomes 4. The standard deviation is now σ = √(2/4) = √0.5 ≈ 0.7071. This mode is preferred when gradient flow during backpropagation is the primary concern. Each weight is computed by scaling standard normal samples by this σ value. Notice the smaller scaling factor compared to fan_in mode, reflecting the larger fan dimension.
Constraints