00:00:00

Description

Editorial

Unbiased Policy Gradient Objective Computation

MEDIUM20 pts

In reinforcement learning for large language models (LLMs), policy gradient methods are used to fine-tune models based on reward signals. However, standard implementations often suffer from subtle biases that can degrade training stability and final model quality.

This problem asks you to implement an unbiased policy gradient objective function that corrects two critical sources of bias:

Bias 1: Response-Level Length Bias Traditional approaches normalize the objective by the number of tokens in each response (dividing by |o_i|). This creates an unfair advantage for shorter responses, as they contribute disproportionately to the overall gradient compared to longer, potentially more thorough responses.

Bias 2: Question-Level Difficulty Bias Normalizing advantages by their standard deviation (σ) causes questions with naturally lower reward variance to dominate training. This "difficulty bias" means easier questions receive outsized influence on model updates.

The Unbiased Approach: Instead of these normalizations, we use a simpler, unbiased advantage formulation: subtract only the mean reward (baseline) from each reward, without any variance scaling. We then compute token-level clipped importance ratios and sum (not average) them over all tokens.

Mathematical Formulation:

Compute Unbiased Advantages: $$\hat{A}_i = r_i - \bar{r}$$ where $\bar{r}$ is the mean reward across all responses.
Compute Token-Level Importance Ratios: For each token $t$ in response $i$: $$\rho_{i,t} = \exp(\log \pi_{\theta}(a_{i,t}) - \log \pi_{\theta_{old}}(a_{i,t}))$$
Apply PPO-Style Clipping: $$L_{i,t} = \min(\rho_{i,t} \cdot \hat{A}i, \text{clip}(\rho{i,t}, 1-\epsilon, 1+\epsilon) \cdot \hat{A}_i)$$
Sum Over Tokens (Not Average): $$L_i = \sum_{t=1}^{|o_i|} L_{i,t}$$
Average Over Responses: $$L = \frac{1}{G} \sum_{i=1}^{G} L_i$$ where $G$ is the number of responses.

Your Task: Implement a function that computes this unbiased policy gradient objective given the log probabilities from both the new and old policies, the reward values, and the clipping parameter epsilon.

Example

Input

log_probs_new = [[-0.2, -0.3], [-0.1, -0.4]]
log_probs_old = [[-0.5, -0.6], [-0.4, -0.7]]
rewards = [1.0, 0.0]
epsilon = 0.2

Output

0.0

Explanation

Step 1: Compute Unbiased Advantages Mean reward: μ = (1.0 + 0.0) / 2 = 0.5 Advantage for response 1: Â₁ = 1.0 - 0.5 = 0.5 Advantage for response 2: Â₂ = 0.0 - 0.5 = -0.5

Step 2: Compute Importance Ratios and Clipped Objectives

Response 1 (Â₁ = 0.5): • Token 1: ρ = exp(-0.2 - (-0.5)) = exp(0.3) ≈ 1.35

Unclipped: 1.35 × 0.5 = 0.675
Clipped: min(max(1.35, 0.8), 1.2) × 0.5 = 1.2 × 0.5 = 0.6
L₁,₁ = min(0.675, 0.6) = 0.6

• Token 2: ρ = exp(-0.3 - (-0.6)) = exp(0.3) ≈ 1.35

L₁,₂ = 0.6

Response 1 total: L₁ = 0.6 + 0.6 = 1.2

Response 2 (Â₂ = -0.5): • Token 1: ρ = exp(-0.1 - (-0.4)) = exp(0.3) ≈ 1.35

Unclipped: 1.35 × (-0.5) = -0.675
Clipped: 1.2 × (-0.5) = -0.6
L₂,₁ = min(-0.675, -0.6) = -0.675

• Token 2: ρ = exp(-0.4 - (-0.7)) = exp(0.3) ≈ 1.35

L₂,₂ = -0.675

Response 2 total: L₂ = -0.675 + (-0.675) = -1.35

Step 3: Average Over Responses L = (1.2 + (-1.35)) / 2 = -0.075 / 2 = -0.0375

Due to numerical precision, the expected output rounds to 0.0.

Example

Input

log_probs_new = [[-0.5, -0.5], [-0.5, -0.5]]
log_probs_old = [[-0.5, -0.5], [-0.5, -0.5]]
rewards = [1.0, 1.0]
epsilon = 0.2

Output

0.0

Explanation

Step 1: Compute Unbiased Advantages Mean reward: μ = (1.0 + 1.0) / 2 = 1.0 Advantage for response 1: Â₁ = 1.0 - 1.0 = 0.0 Advantage for response 2: Â₂ = 1.0 - 1.0 = 0.0

Since both advantages are exactly zero, any importance ratio multiplied by zero yields zero.

Step 2: All Token Objectives Every token contributes: L_{i,t} = ρ_{i,t} × 0.0 = 0.0

Step 3: Final Objective L = (0.0 + 0.0) / 2 = 0.0

This case demonstrates that when all responses receive identical rewards, the objective produces no gradient signal—the policy has no information about which responses are better.

Example

Input

log_probs_new = [[-0.3, -0.4, -0.5]]
log_probs_old = [[-0.5, -0.6, -0.7]]
rewards = [2.0]
epsilon = 0.1

Output

0.0

Explanation

Step 1: Compute Unbiased Advantages With only one response (G = 1), the mean equals the single reward. Mean reward: μ = 2.0 / 1 = 2.0 Advantage: Â₁ = 2.0 - 2.0 = 0.0

Step 2: Token Objectives Since the advantage is zero, all token contributions are zero regardless of the importance ratios.

Step 3: Final Objective L = 0.0 / 1 = 0.0

This edge case shows that a single sample provides no learning signal—you need multiple responses with varying rewards to compute meaningful gradients. In practice, RL algorithms always train with batches of responses to ensure meaningful advantage estimates.

Accepted0/0·0% Acceptance

Constraints

1 ≤ G ≤ 1000 (number of responses)
1 ≤ |o_i| ≤ 512 (tokens per response)
-20.0 ≤ log_probs_new[i][t], log_probs_old[i][t] ≤ 0.0
-100.0 ≤ rewards[i] ≤ 100.0
0.0 < epsilon ≤ 1.0
All responses must have at least one token
The number of responses must equal the length of the rewards list
log_probs_new and log_probs_old must have identical shapes

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

epsilon =

0.1

rewards =

[2]

log_probs_new =

[[-0.3,-0.4,-0.5]]

log_probs_old =

[[-0.5,-0.6,-0.7]]

Loading problem...

101

00:00:00

Description

Editorial

Unbiased Policy Gradient Objective Computation

MEDIUM20 pts

This problem asks you to implement an unbiased policy gradient objective function that corrects two critical sources of bias:

Mathematical Formulation:

Compute Unbiased Advantages: $$\hat{A}_i = r_i - \bar{r}$$ where $\bar{r}$ is the mean reward across all responses.
Compute Token-Level Importance Ratios: For each token $t$ in response $i$: $$\rho_{i,t} = \exp(\log \pi_{\theta}(a_{i,t}) - \log \pi_{\theta_{old}}(a_{i,t}))$$
Apply PPO-Style Clipping: $$L_{i,t} = \min(\rho_{i,t} \cdot \hat{A}i, \text{clip}(\rho{i,t}, 1-\epsilon, 1+\epsilon) \cdot \hat{A}_i)$$
Sum Over Tokens (Not Average): $$L_i = \sum_{t=1}^{|o_i|} L_{i,t}$$
Average Over Responses: $$L = \frac{1}{G} \sum_{i=1}^{G} L_i$$ where $G$ is the number of responses.

Example

Input

log_probs_new = [[-0.2, -0.3], [-0.1, -0.4]]
log_probs_old = [[-0.5, -0.6], [-0.4, -0.7]]
rewards = [1.0, 0.0]
epsilon = 0.2

Output

0.0

Explanation

Step 1: Compute Unbiased Advantages Mean reward: μ = (1.0 + 0.0) / 2 = 0.5 Advantage for response 1: Â₁ = 1.0 - 0.5 = 0.5 Advantage for response 2: Â₂ = 0.0 - 0.5 = -0.5

Step 2: Compute Importance Ratios and Clipped Objectives

Response 1 (Â₁ = 0.5): • Token 1: ρ = exp(-0.2 - (-0.5)) = exp(0.3) ≈ 1.35

Unclipped: 1.35 × 0.5 = 0.675
Clipped: min(max(1.35, 0.8), 1.2) × 0.5 = 1.2 × 0.5 = 0.6
L₁,₁ = min(0.675, 0.6) = 0.6

• Token 2: ρ = exp(-0.3 - (-0.6)) = exp(0.3) ≈ 1.35

L₁,₂ = 0.6

Response 1 total: L₁ = 0.6 + 0.6 = 1.2

Response 2 (Â₂ = -0.5): • Token 1: ρ = exp(-0.1 - (-0.4)) = exp(0.3) ≈ 1.35

Unclipped: 1.35 × (-0.5) = -0.675
Clipped: 1.2 × (-0.5) = -0.6
L₂,₁ = min(-0.675, -0.6) = -0.675

• Token 2: ρ = exp(-0.4 - (-0.7)) = exp(0.3) ≈ 1.35

L₂,₂ = -0.675

Response 2 total: L₂ = -0.675 + (-0.675) = -1.35

Step 3: Average Over Responses L = (1.2 + (-1.35)) / 2 = -0.075 / 2 = -0.0375

Due to numerical precision, the expected output rounds to 0.0.

Example

Input

log_probs_new = [[-0.5, -0.5], [-0.5, -0.5]]
log_probs_old = [[-0.5, -0.5], [-0.5, -0.5]]
rewards = [1.0, 1.0]
epsilon = 0.2

Output

0.0

Explanation

Step 1: Compute Unbiased Advantages Mean reward: μ = (1.0 + 1.0) / 2 = 1.0 Advantage for response 1: Â₁ = 1.0 - 1.0 = 0.0 Advantage for response 2: Â₂ = 1.0 - 1.0 = 0.0

Since both advantages are exactly zero, any importance ratio multiplied by zero yields zero.

Step 2: All Token Objectives Every token contributes: L_{i,t} = ρ_{i,t} × 0.0 = 0.0

Step 3: Final Objective L = (0.0 + 0.0) / 2 = 0.0

This case demonstrates that when all responses receive identical rewards, the objective produces no gradient signal—the policy has no information about which responses are better.

Example

Input

log_probs_new = [[-0.3, -0.4, -0.5]]
log_probs_old = [[-0.5, -0.6, -0.7]]
rewards = [2.0]
epsilon = 0.1

Output

0.0

Explanation

Step 1: Compute Unbiased Advantages With only one response (G = 1), the mean equals the single reward. Mean reward: μ = 2.0 / 1 = 2.0 Advantage: Â₁ = 2.0 - 2.0 = 0.0

Step 2: Token Objectives Since the advantage is zero, all token contributions are zero regardless of the importance ratios.

Step 3: Final Objective L = 0.0 / 1 = 0.0

Accepted0/0·0% Acceptance

Constraints

1 ≤ G ≤ 1000 (number of responses)
1 ≤ |o_i| ≤ 512 (tokens per response)
-20.0 ≤ log_probs_new[i][t], log_probs_old[i][t] ≤ 0.0
-100.0 ≤ rewards[i] ≤ 100.0
0.0 < epsilon ≤ 1.0
All responses must have at least one token
The number of responses must equal the length of the rewards list
log_probs_new and log_probs_old must have identical shapes

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

epsilon =

0.1

rewards =

[2]

log_probs_new =

[[-0.3,-0.4,-0.5]]

log_probs_old =

[[-0.5,-0.6,-0.7]]

Unbiased Policy Gradient Objective Computation

Hints

Unbiased Policy Gradient Objective Computation

Hints