Loading problem...
In modern reinforcement learning approaches for fine-tuning large language models (LLMs), policy gradient methods are essential for aligning model outputs with human preferences. However, traditional token-level optimization techniques often suffer from severe training instability and can lead to catastrophic model collapse, where the model's outputs degenerate rapidly during training.
The Sequence-Level Policy Gradient Objective addresses these challenges by operating at the sequence granularity rather than the token level. This approach computes importance sampling ratios that measure how much the current policy has diverged from a reference policy, normalizes these ratios by sequence length to prevent shorter sequences from dominating, and applies clipping to constrain policy updates within safe bounds.
Given a group of N sequences, the algorithm proceeds as follows:
First, compute the standardized advantages from the raw rewards:
$$A_i = \frac{r_i - \mu_r}{\sigma_r}$$
where:
If $\sigma_r = 0$ (all rewards are identical), the advantages should be set to zero.
For each sequence $i$ with $T_i$ tokens, compute the sequence-level importance ratio:
$$\rho_i = \exp\left(\frac{1}{T_i}\sum_{t=1}^{T_i}\left(\log\pi_{\theta}(a_t|s_t) - \log\pi_{\text{old}}(a_t|s_t)\right)\right)$$
This can be rewritten as: $$\rho_i = \exp\left(\frac{\sum_{t=1}^{T_i}\log\pi_{\theta}(a_t|s_t) - \sum_{t=1}^{T_i}\log\pi_{\text{old}}(a_t|s_t)}{T_i}\right)$$
The length normalization ($\div T_i$) ensures that longer sequences don't have disproportionately large importance ratios.
Apply the clipping function to bound the importance ratio:
$$\rho_i^{\text{clip}} = \text{clip}(\rho_i, 1-\epsilon, 1+\epsilon)$$
The per-sequence objective uses the pessimistic bound (minimum of clipped and unclipped):
$$L_i = \min(\rho_i \cdot A_i, \rho_i^{\text{clip}} \cdot A_i)$$
The final objective is the mean across all sequences:
$$L = \frac{1}{N}\sum_{i=1}^{N} L_i$$
Your Task: Implement a function that computes this sequence-level policy gradient objective given the log probabilities from new and old policies, sequence rewards, and the clipping parameter epsilon.
log_probs_new = [[1.0, 0.5], [0.8, 1.2]]
log_probs_old = [[0.0, 0.0], [0.0, 0.0]]
rewards = [0.9, 0.7]
epsilon = 0.2-0.7591Step 1: Compute Advantages Mean reward μ = (0.9 + 0.7) / 2 = 0.8 Std deviation σ = sqrt(((0.9-0.8)² + (0.7-0.8)²) / 2) = sqrt(0.01) = 0.1 Advantages: A₁ = (0.9 - 0.8) / 0.1 = 1.0, A₂ = (0.7 - 0.8) / 0.1 = -1.0
Step 2: Compute Importance Ratios Sequence 1: log_ratio_sum = (1.0 - 0.0) + (0.5 - 0.0) = 1.5, length = 2 ρ₁ = exp(1.5 / 2) = exp(0.75) ≈ 2.117 Sequence 2: log_ratio_sum = (0.8 - 0.0) + (1.2 - 0.0) = 2.0, length = 2 ρ₂ = exp(2.0 / 2) = exp(1.0) ≈ 2.718
Step 3: Clip Ratios Clipping range: [1 - 0.2, 1 + 0.2] = [0.8, 1.2] ρ₁_clipped = clip(2.117, 0.8, 1.2) = 1.2 ρ₂_clipped = clip(2.718, 0.8, 1.2) = 1.2
Step 4: Compute Per-Sequence Objectives L₁ = min(2.117 × 1.0, 1.2 × 1.0) = min(2.117, 1.2) = 1.2 L₂ = min(2.718 × (-1.0), 1.2 × (-1.0)) = min(-2.718, -1.2) = -2.718
Step 5: Final Objective L = (1.2 + (-2.718)) / 2 ≈ -0.759
The clipping prevents the positive advantage sequence from contributing too much, while the negative advantage sequence uses the unclipped (more pessimistic) ratio.
log_probs_new = [[0.5, 0.5], [0.5, 0.5]]
log_probs_old = [[0.5, 0.5], [0.5, 0.5]]
rewards = [1.0, -1.0]
epsilon = 0.20.0When the new and old policies have identical log probabilities, the importance ratios equal 1.0 for all sequences.
Step 1: Compute Advantages Mean reward μ = (1.0 + (-1.0)) / 2 = 0.0 Std deviation σ = sqrt(((1.0-0.0)² + ((-1.0)-0.0)²) / 2) = sqrt(1.0) = 1.0 Advantages: A₁ = 1.0 / 1.0 = 1.0, A₂ = -1.0 / 1.0 = -1.0
Step 2: Compute Importance Ratios For both sequences, log_ratio_sum = 0, so ρ = exp(0) = 1.0
Step 3: No Clipping Needed Since ρ = 1.0 is within [0.8, 1.2], ρ_clipped = 1.0
Step 4: Compute Objectives L₁ = min(1.0 × 1.0, 1.0 × 1.0) = 1.0 L₂ = min(1.0 × (-1.0), 1.0 × (-1.0)) = -1.0
Step 5: Final Objective L = (1.0 + (-1.0)) / 2 = 0.0
The symmetric rewards and identical policies result in a zero objective.
log_probs_new = [[-0.5, -0.3], [-0.2, -0.4], [-0.1, -0.6]]
log_probs_old = [[-0.6, -0.4], [-0.3, -0.5], [-0.2, -0.7]]
rewards = [0.8, 0.5, 0.3]
epsilon = 0.10.0This example demonstrates small policy differences with a tighter clipping threshold (ε=0.1).
Step 1: Compute Advantages Mean reward μ = (0.8 + 0.5 + 0.3) / 3 ≈ 0.533 Variance = ((0.8-0.533)² + (0.5-0.533)² + (0.3-0.533)²) / 3 ≈ 0.0422 Std deviation σ ≈ 0.205 Advantages: A₁ ≈ 1.30, A₂ ≈ -0.16, A₃ ≈ -1.14
Step 2: Compute Importance Ratios Seq 1: log_ratio = (-0.5+0.6) + (-0.3+0.4) = 0.2, ρ₁ = exp(0.2/2) = exp(0.1) ≈ 1.105 Seq 2: log_ratio = (-0.2+0.3) + (-0.4+0.5) = 0.2, ρ₂ = exp(0.2/2) ≈ 1.105 Seq 3: log_ratio = (-0.1+0.2) + (-0.6+0.7) = 0.2, ρ₃ = exp(0.2/2) ≈ 1.105
Step 3: Clip with ε=0.1 Clipping range: [0.9, 1.1] All ratios ≈ 1.105, clipped to 1.1
Step 4: Compute Pessimistic Objectives The final objective depends on the sign of advantages and which bound is more pessimistic.
Constraints