Loading problem...
In reinforcement learning for large language models (LLMs), policy gradient methods are used to fine-tune models based on reward signals. However, standard implementations often suffer from subtle biases that can degrade training stability and final model quality.
This problem asks you to implement an unbiased policy gradient objective function that corrects two critical sources of bias:
Bias 1: Response-Level Length Bias Traditional approaches normalize the objective by the number of tokens in each response (dividing by |o_i|). This creates an unfair advantage for shorter responses, as they contribute disproportionately to the overall gradient compared to longer, potentially more thorough responses.
Bias 2: Question-Level Difficulty Bias Normalizing advantages by their standard deviation (σ) causes questions with naturally lower reward variance to dominate training. This "difficulty bias" means easier questions receive outsized influence on model updates.
The Unbiased Approach: Instead of these normalizations, we use a simpler, unbiased advantage formulation: subtract only the mean reward (baseline) from each reward, without any variance scaling. We then compute token-level clipped importance ratios and sum (not average) them over all tokens.
Mathematical Formulation:
Compute Unbiased Advantages: $$\hat{A}_i = r_i - \bar{r}$$ where $\bar{r}$ is the mean reward across all responses.
Compute Token-Level Importance Ratios: For each token $t$ in response $i$: $$\rho_{i,t} = \exp(\log \pi_{\theta}(a_{i,t}) - \log \pi_{\theta_{old}}(a_{i,t}))$$
Apply PPO-Style Clipping: $$L_{i,t} = \min(\rho_{i,t} \cdot \hat{A}i, \text{clip}(\rho{i,t}, 1-\epsilon, 1+\epsilon) \cdot \hat{A}_i)$$
Sum Over Tokens (Not Average): $$L_i = \sum_{t=1}^{|o_i|} L_{i,t}$$
Average Over Responses: $$L = \frac{1}{G} \sum_{i=1}^{G} L_i$$ where $G$ is the number of responses.
Your Task: Implement a function that computes this unbiased policy gradient objective given the log probabilities from both the new and old policies, the reward values, and the clipping parameter epsilon.
log_probs_new = [[-0.2, -0.3], [-0.1, -0.4]]
log_probs_old = [[-0.5, -0.6], [-0.4, -0.7]]
rewards = [1.0, 0.0]
epsilon = 0.20.0Step 1: Compute Unbiased Advantages Mean reward: μ = (1.0 + 0.0) / 2 = 0.5 Advantage for response 1: Â₁ = 1.0 - 0.5 = 0.5 Advantage for response 2: Â₂ = 0.0 - 0.5 = -0.5
Step 2: Compute Importance Ratios and Clipped Objectives
Response 1 (Â₁ = 0.5): • Token 1: ρ = exp(-0.2 - (-0.5)) = exp(0.3) ≈ 1.35
• Token 2: ρ = exp(-0.3 - (-0.6)) = exp(0.3) ≈ 1.35
Response 1 total: L₁ = 0.6 + 0.6 = 1.2
Response 2 (Â₂ = -0.5): • Token 1: ρ = exp(-0.1 - (-0.4)) = exp(0.3) ≈ 1.35
• Token 2: ρ = exp(-0.4 - (-0.7)) = exp(0.3) ≈ 1.35
Response 2 total: L₂ = -0.675 + (-0.675) = -1.35
Step 3: Average Over Responses L = (1.2 + (-1.35)) / 2 = -0.075 / 2 = -0.0375
Due to numerical precision, the expected output rounds to 0.0.
log_probs_new = [[-0.5, -0.5], [-0.5, -0.5]]
log_probs_old = [[-0.5, -0.5], [-0.5, -0.5]]
rewards = [1.0, 1.0]
epsilon = 0.20.0Step 1: Compute Unbiased Advantages Mean reward: μ = (1.0 + 1.0) / 2 = 1.0 Advantage for response 1: Â₁ = 1.0 - 1.0 = 0.0 Advantage for response 2: Â₂ = 1.0 - 1.0 = 0.0
Since both advantages are exactly zero, any importance ratio multiplied by zero yields zero.
Step 2: All Token Objectives Every token contributes: L_{i,t} = ρ_{i,t} × 0.0 = 0.0
Step 3: Final Objective L = (0.0 + 0.0) / 2 = 0.0
This case demonstrates that when all responses receive identical rewards, the objective produces no gradient signal—the policy has no information about which responses are better.
log_probs_new = [[-0.3, -0.4, -0.5]]
log_probs_old = [[-0.5, -0.6, -0.7]]
rewards = [2.0]
epsilon = 0.10.0Step 1: Compute Unbiased Advantages With only one response (G = 1), the mean equals the single reward. Mean reward: μ = 2.0 / 1 = 2.0 Advantage: Â₁ = 2.0 - 2.0 = 0.0
Step 2: Token Objectives Since the advantage is zero, all token contributions are zero regardless of the importance ratios.
Step 3: Final Objective L = 0.0 / 1 = 0.0
This edge case shows that a single sample provides no learning signal—you need multiple responses with varying rewards to compute meaningful gradients. In practice, RL algorithms always train with batches of responses to ensure meaningful advantage estimates.
Constraints