Sparse Mixture-of-Experts: Dynamic Expert Selection and Aggregation (Medium) — Practice with Code Visualizer

Sparse Mixture-of-Experts (MoE) architectures represent one of the most transformative paradigms in modern deep learning, enabling models to achieve unprecedented scale while maintaining computational efficiency. Unlike dense neural networks where every parameter participates in every computation, MoE models employ a conditional computation strategy—activating only a subset of specialized "experts" for each input token.

The Architecture of Sparse Expertise

In a Sparse MoE layer, the architecture consists of two key components:

Router Network: A lightweight gating mechanism that examines each input token and produces a probability distribution over all available experts
Expert Networks: A collection of N independent feed-forward networks (experts), each specializing in different aspects of the input space

Modern flagship models leverage this architecture at massive scale:

DeepSeek-V3 deploys 671B total parameters with 256 experts, activating only 37B parameters per token
Kimi K2 utilizes 384 experts with only 8 active per token
Mixtral 8x7B employs 8 experts with top-2 routing

The Top-K Routing Mechanism

The routing process operates as follows:

Step 1 - Score Computation: For each input token, the router computes raw logit scores for all N experts: $$\text{router_logits} = W_r \cdot x$$

Step 2 - Expert Selection: Select the top-K experts based on their logit scores. Let $\mathcal{T}_k$ denote the set of indices for the top-K experts.

Step 3 - Weight Normalization: Apply softmax only over the selected experts to compute normalized routing weights: $$w_i = \frac{\exp(\text{logit}i)}{\sum{j \in \mathcal{T}_k} \exp(\text{logit}_j)} \quad \text{for } i \in \mathcal{T}_k$$

Step 4 - Expert Output Aggregation: Combine the outputs of selected experts using the computed weights: $$\text{output} = \sum_{i \in \mathcal{T}_k} w_i \cdot E_i(x)$$

where $E_i(x)$ represents the output of expert $i$ for input $x$.

Your Implementation Task

Implement the complete top-K routing logic for a Sparse MoE layer. Given the router logits for each token, the pre-computed outputs from all experts, and the sparsity parameter K, your function should:

Identify the top-K experts for each token based on router logits
Compute softmax-normalized routing weights over only the selected experts
Return the weighted combination of the selected expert outputs

Key Insight: The softmax normalization occurs after expert selection, meaning we compute a probability distribution only over the K selected experts, not over all N experts. This ensures routing weights sum to 1.0 for each token.

Single Token Routing with K=2

For the single token, we have 4 experts with logits [2.0, 1.0, 0.5, 0.1].

Step 1 - Expert Selection: Top-2 experts are: • Expert 0: logit = 2.0 (highest) • Expert 1: logit = 1.0 (second highest)

Step 2 - Weight Computation: Softmax over selected logits [2.0, 1.0]: • w₀ = exp(2.0) / (exp(2.0) + exp(1.0)) = 7.389 / (7.389 + 2.718) ≈ 0.731 • w₁ = exp(1.0) / (exp(2.0) + exp(1.0)) = 2.718 / (7.389 + 2.718) ≈ 0.269

Step 3 - Output Aggregation: • Expert 0 output: [1, 0] • Expert 1 output: [0, 1] • Final = 0.731 × [1, 0] + 0.269 × [0, 1] = [0.731, 0.269]

Multi-Token Routing with K=1 (Hard Routing)

With K=1, each token is routed to exactly one expert (hard routing).

Token 0: Logits = [1.0, 2.0, 0.5] • Top expert: Expert 1 (logit = 2.0) • Softmax of single value = 1.0 • Output = 1.0 × [0, 1, 0] = [0.0, 1.0, 0.0]

Token 1: Logits = [3.0, 1.0, 0.5] • Top expert: Expert 0 (logit = 3.0) • Softmax of single value = 1.0 • Output = 1.0 × [1, 1, 0] = [1.0, 1.0, 0.0]

Multi-Token Routing with K=3

Token 0: Logits = [1.5, 2.5, 3.5, 0.5] • Top-3 experts: Expert 2 (3.5), Expert 1 (2.5), Expert 0 (1.5) • Softmax over [3.5, 2.5, 1.5]:

w₂ = exp(3.5) / Σ ≈ 0.665
w₁ = exp(2.5) / Σ ≈ 0.245
w₀ = exp(1.5) / Σ ≈ 0.090 • Output = 0.665×[1,1] + 0.245×[0,1] + 0.090×[1,0] ≈ [0.755, 0.91]

Token 1: Logits = [4.0, 3.0, 2.0, 1.0] • Top-3 experts: Expert 0 (4.0), Expert 1 (3.0), Expert 2 (2.0) • Softmax over [4.0, 3.0, 2.0]:

w₀ = exp(4.0) / Σ ≈ 0.665
w₁ = exp(3.0) / Σ ≈ 0.245
w₂ = exp(2.0) / Σ ≈ 0.090 • Output = 0.665×[2,1] + 0.245×[1,2] + 0.090×[0,1] ≈ [1.575, 1.245]

The Architecture of Sparse Expertise

In a Sparse MoE layer, the architecture consists of two key components:

Router Network: A lightweight gating mechanism that examines each input token and produces a probability distribution over all available experts
Expert Networks: A collection of N independent feed-forward networks (experts), each specializing in different aspects of the input space

Modern flagship models leverage this architecture at massive scale:

DeepSeek-V3 deploys 671B total parameters with 256 experts, activating only 37B parameters per token
Kimi K2 utilizes 384 experts with only 8 active per token
Mixtral 8x7B employs 8 experts with top-2 routing

The Top-K Routing Mechanism

The routing process operates as follows:

Step 1 - Score Computation: For each input token, the router computes raw logit scores for all N experts: $$\text{router_logits} = W_r \cdot x$$

Step 2 - Expert Selection: Select the top-K experts based on their logit scores. Let $\mathcal{T}_k$ denote the set of indices for the top-K experts.

Step 4 - Expert Output Aggregation: Combine the outputs of selected experts using the computed weights: $$\text{output} = \sum_{i \in \mathcal{T}_k} w_i \cdot E_i(x)$$

where $E_i(x)$ represents the output of expert $i$ for input $x$.

Your Implementation Task

Identify the top-K experts for each token based on router logits
Compute softmax-normalized routing weights over only the selected experts
Return the weighted combination of the selected expert outputs

Single Token Routing with K=2

For the single token, we have 4 experts with logits [2.0, 1.0, 0.5, 0.1].

Step 1 - Expert Selection: Top-2 experts are: • Expert 0: logit = 2.0 (highest) • Expert 1: logit = 1.0 (second highest)

Step 3 - Output Aggregation: • Expert 0 output: [1, 0] • Expert 1 output: [0, 1] • Final = 0.731 × [1, 0] + 0.269 × [0, 1] = [0.731, 0.269]

Multi-Token Routing with K=1 (Hard Routing)

With K=1, each token is routed to exactly one expert (hard routing).

Token 0: Logits = [1.0, 2.0, 0.5] • Top expert: Expert 1 (logit = 2.0) • Softmax of single value = 1.0 • Output = 1.0 × [0, 1, 0] = [0.0, 1.0, 0.0]

Token 1: Logits = [3.0, 1.0, 0.5] • Top expert: Expert 0 (logit = 3.0) • Softmax of single value = 1.0 • Output = 1.0 × [1, 1, 0] = [1.0, 1.0, 0.0]

Multi-Token Routing with K=3

Token 0: Logits = [1.5, 2.5, 3.5, 0.5] • Top-3 experts: Expert 2 (3.5), Expert 1 (2.5), Expert 0 (1.5) • Softmax over [3.5, 2.5, 1.5]:

w₂ = exp(3.5) / Σ ≈ 0.665
w₁ = exp(2.5) / Σ ≈ 0.245
w₀ = exp(1.5) / Σ ≈ 0.090 • Output = 0.665×[1,1] + 0.245×[0,1] + 0.090×[1,0] ≈ [0.755, 0.91]

Token 1: Logits = [4.0, 3.0, 2.0, 1.0] • Top-3 experts: Expert 0 (4.0), Expert 1 (3.0), Expert 2 (2.0) • Softmax over [4.0, 3.0, 2.0]:

w₀ = exp(4.0) / Σ ≈ 0.665
w₁ = exp(3.0) / Σ ≈ 0.245
w₂ = exp(2.0) / Σ ≈ 0.090 • Output = 0.665×[2,1] + 0.245×[1,2] + 0.090×[0,1] ≈ [1.575, 1.245]

Sparse Mixture-of-Experts: Dynamic Expert Selection and Aggregation

The Architecture of Sparse Expertise

The Top-K Routing Mechanism

Your Implementation Task

Hints

Sparse Mixture-of-Experts: Dynamic Expert Selection and Aggregation

The Architecture of Sparse Expertise

The Top-K Routing Mechanism

Your Implementation Task

Hints