Sparse Mixture of Experts Forward Pass (Hard) — Practice with Code Visualizer

The Mixture-of-Experts (MoE) architecture is a powerful paradigm in deep learning that enables efficient scaling of neural networks by conditionally activating only a subset of model parameters for each input. Instead of processing every input through the entire network, MoE employs a gating mechanism to dynamically route inputs to the most relevant "expert" sub-networks, dramatically improving computational efficiency while maintaining or even increasing model capacity.

Conceptual Foundation

In a Sparse MoE layer, we have:

E experts: Each expert is a separate neural network (typically a feed-forward network) with its own parameters
A gating network: Computes routing probabilities that determine which experts process each input token
Top-k selection: Only the top k experts (out of E) are activated for each input, introducing sparsity

This sparsity is the key innovation—while the model has massive total capacity (all experts combined), the computation cost per input scales with k, not E.

Mathematical Formulation

Given an input tensor x of shape (batch_size, sequence_length, d_input), the Sparse MoE forward pass proceeds as follows:

Step 1: Compute Gating Logits

For each token, compute the raw gating scores by multiplying with the gating weight matrix: $$\text{logits} = x \cdot W_g$$ where W_g has shape (d_input, num_experts), producing logits of shape (batch_size, seq_len, num_experts).

Step 2: Apply Softmax for Probabilities

Convert logits to probabilities using the softmax function: $$g_i = \frac{e^{\text{logits}i}}{\sum{j=1}^{E} e^{\text{logits}_j}}$$

Step 3: Top-k Expert Selection

Select the indices of the top-k experts with the highest gating probabilities for each token. This ensures sparse activation—only k experts are utilized per token.

Step 4: Renormalize Selected Probabilities

Renormalize the gating probabilities for only the selected experts so they sum to 1: $$\hat{g}i = \frac{g_i}{\sum{j \in \text{top-k}} g_j}$$

Step 5: Expert Computation

For each selected expert, compute its output by applying its weight matrix to the input token: $$\text{expert_output}i = x \cdot W{e_i}$$ where W_e has shape (num_experts, d_input, d_output).

Step 6: Weighted Aggregation

Combine the expert outputs using the renormalized gating weights: $$y = \sum_{i \in \text{top-k}} \hat{g}_i \cdot \text{expert_output}_i$$

Your Task

Implement the sparse_moe_forward function that computes the forward pass of a Sparse Mixture-of-Experts layer. Your function should:

Compute gating logits and apply softmax to get routing probabilities
Select the top-k experts for each token based on gating scores
Renormalize the selected expert probabilities
Apply each selected expert's transformation to the input
Aggregate expert outputs weighted by the renormalized probabilities
Return the final output tensor

Input Analysis:

x has shape (2, 3, 2): 2 batches, 3 tokens each, dimension 2
We has 4 experts, each with a 2×2 weight matrix of all ones
Wg is a uniform gating matrix that assigns equal scores to all experts
top_k = 1 means only the top expert is selected per token

Step-by-step Computation (for first token [0.0, 1.0]):

Gating logits: [0.0, 1.0] @ Wg = [1.0, 1.0, 1.0, 1.0]
Softmax: All equal logits → uniform probabilities [0.25, 0.25, 0.25, 0.25]
Top-1 selection: Expert 0 is selected (ties broken arbitrarily)
Expert output: [0.0, 1.0] @ We[0] = [0.0+1.0, 0.0+1.0] = [1.0, 1.0]
Weighted combination: 1.0 × [1.0, 1.0] = [1.0, 1.0]

For token [2.0, 3.0]: output = [2+3, 2+3] = [5.0, 5.0] For token [4.0, 5.0]: output = [4+5, 4+5] = [9.0, 9.0]

And similarly for the second batch, producing the final output tensor.

Input Analysis:

x has shape (1, 2, 2): 1 batch, 2 tokens, dimension 2
We has 2 experts with identical 2×2 weight matrices
Wg produces equal gating scores for both experts
top_k = 2 means both experts are selected

Step-by-step Computation (for token [1.0, 2.0]):

Gating logits: [1.0, 2.0] @ Wg = [1.5, 1.5]
Softmax: Equal logits → [0.5, 0.5] probabilities
Top-2 selection: Both experts selected with weights [0.5, 0.5]
Expert outputs:
- Expert 0: [1.0, 2.0] @ We[0] = [3.0, 3.0]
- Expert 1: [1.0, 2.0] @ We[1] = [3.0, 3.0]
Weighted combination: 0.5 × [3.0, 3.0] + 0.5 × [3.0, 3.0] = [3.0, 3.0]

For token [3.0, 4.0]: 0.5 × [7.0, 7.0] + 0.5 × [7.0, 7.0] = [7.0, 7.0]

Input Analysis:

x has shape (1, 1, 2): Single batch, single token, dimension 2
Expert 0 scales input by 2 (diagonal scaling matrix)
Expert 1 is an identity transformation
Wg biases heavily toward Expert 0

Step-by-step Computation:

Gating logits: [1.0, 1.0] @ Wg = [2.0, 0.0]
Softmax: softmax([2.0, 0.0]) ≈ [0.88, 0.12]
Top-1 selection: Expert 0 selected (highest probability)
Expert output: [1.0, 1.0] @ We[0] = [2.0, 2.0]
Weighted combination: 1.0 × [2.0, 2.0] = [2.0, 2.0]

The result [2.0, 2.0] shows that Expert 0's 2× scaling transformation was applied.

Conceptual Foundation

In a Sparse MoE layer, we have:

E experts: Each expert is a separate neural network (typically a feed-forward network) with its own parameters
A gating network: Computes routing probabilities that determine which experts process each input token
Top-k selection: Only the top k experts (out of E) are activated for each input, introducing sparsity

This sparsity is the key innovation—while the model has massive total capacity (all experts combined), the computation cost per input scales with k, not E.

Mathematical Formulation

Given an input tensor x of shape (batch_size, sequence_length, d_input), the Sparse MoE forward pass proceeds as follows:

Step 1: Compute Gating Logits

Step 2: Apply Softmax for Probabilities

Convert logits to probabilities using the softmax function: $$g_i = \frac{e^{\text{logits}i}}{\sum{j=1}^{E} e^{\text{logits}_j}}$$

Step 3: Top-k Expert Selection

Select the indices of the top-k experts with the highest gating probabilities for each token. This ensures sparse activation—only k experts are utilized per token.

Step 4: Renormalize Selected Probabilities

Renormalize the gating probabilities for only the selected experts so they sum to 1: $$\hat{g}i = \frac{g_i}{\sum{j \in \text{top-k}} g_j}$$

Step 5: Expert Computation

Step 6: Weighted Aggregation

Combine the expert outputs using the renormalized gating weights: $$y = \sum_{i \in \text{top-k}} \hat{g}_i \cdot \text{expert_output}_i$$

Your Task

Implement the sparse_moe_forward function that computes the forward pass of a Sparse Mixture-of-Experts layer. Your function should:

Compute gating logits and apply softmax to get routing probabilities
Select the top-k experts for each token based on gating scores
Renormalize the selected expert probabilities
Apply each selected expert's transformation to the input
Aggregate expert outputs weighted by the renormalized probabilities
Return the final output tensor

Input Analysis:

x has shape (2, 3, 2): 2 batches, 3 tokens each, dimension 2
We has 4 experts, each with a 2×2 weight matrix of all ones
Wg is a uniform gating matrix that assigns equal scores to all experts
top_k = 1 means only the top expert is selected per token

Step-by-step Computation (for first token [0.0, 1.0]):

Gating logits: [0.0, 1.0] @ Wg = [1.0, 1.0, 1.0, 1.0]
Softmax: All equal logits → uniform probabilities [0.25, 0.25, 0.25, 0.25]
Top-1 selection: Expert 0 is selected (ties broken arbitrarily)
Expert output: [0.0, 1.0] @ We[0] = [0.0+1.0, 0.0+1.0] = [1.0, 1.0]
Weighted combination: 1.0 × [1.0, 1.0] = [1.0, 1.0]

For token [2.0, 3.0]: output = [2+3, 2+3] = [5.0, 5.0] For token [4.0, 5.0]: output = [4+5, 4+5] = [9.0, 9.0]

And similarly for the second batch, producing the final output tensor.

Input Analysis:

x has shape (1, 2, 2): 1 batch, 2 tokens, dimension 2
We has 2 experts with identical 2×2 weight matrices
Wg produces equal gating scores for both experts
top_k = 2 means both experts are selected

Step-by-step Computation (for token [1.0, 2.0]):

Gating logits: [1.0, 2.0] @ Wg = [1.5, 1.5]
Softmax: Equal logits → [0.5, 0.5] probabilities
Top-2 selection: Both experts selected with weights [0.5, 0.5]
Expert outputs:
- Expert 0: [1.0, 2.0] @ We[0] = [3.0, 3.0]
- Expert 1: [1.0, 2.0] @ We[1] = [3.0, 3.0]
Weighted combination: 0.5 × [3.0, 3.0] + 0.5 × [3.0, 3.0] = [3.0, 3.0]

For token [3.0, 4.0]: 0.5 × [7.0, 7.0] + 0.5 × [7.0, 7.0] = [7.0, 7.0]

Input Analysis:

x has shape (1, 1, 2): Single batch, single token, dimension 2
Expert 0 scales input by 2 (diagonal scaling matrix)
Expert 1 is an identity transformation
Wg biases heavily toward Expert 0

Step-by-step Computation:

Gating logits: [1.0, 1.0] @ Wg = [2.0, 0.0]
Softmax: softmax([2.0, 0.0]) ≈ [0.88, 0.12]
Top-1 selection: Expert 0 selected (highest probability)
Expert output: [1.0, 1.0] @ We[0] = [2.0, 2.0]
Weighted combination: 1.0 × [2.0, 2.0] = [2.0, 2.0]

The result [2.0, 2.0] shows that Expert 0's 2× scaling transformation was applied.

Sparse Mixture of Experts Forward Pass

Conceptual Foundation

Mathematical Formulation

Step 1: Compute Gating Logits

Step 2: Apply Softmax for Probabilities

Step 3: Top-k Expert Selection

Step 4: Renormalize Selected Probabilities

Step 5: Expert Computation

Step 6: Weighted Aggregation

Your Task

Hints

Sparse Mixture of Experts Forward Pass

Conceptual Foundation

Mathematical Formulation

Step 1: Compute Gating Logits

Step 2: Apply Softmax for Probabilities

Step 3: Top-k Expert Selection

Step 4: Renormalize Selected Probabilities

Step 5: Expert Computation

Step 6: Weighted Aggregation

Your Task

Hints