00:00:00

Description

Editorial

Sequence-Level Policy Gradient Objective

HARD50 pts

In modern reinforcement learning approaches for fine-tuning large language models (LLMs), policy gradient methods are essential for aligning model outputs with human preferences. However, traditional token-level optimization techniques often suffer from severe training instability and can lead to catastrophic model collapse, where the model's outputs degenerate rapidly during training.

The Sequence-Level Policy Gradient Objective addresses these challenges by operating at the sequence granularity rather than the token level. This approach computes importance sampling ratios that measure how much the current policy has diverged from a reference policy, normalizes these ratios by sequence length to prevent shorter sequences from dominating, and applies clipping to constrain policy updates within safe bounds.

Mathematical Formulation

Given a group of N sequences, the algorithm proceeds as follows:

Step 1: Compute Normalized Advantages

First, compute the standardized advantages from the raw rewards:

$$A_i = \frac{r_i - \mu_r}{\sigma_r}$$

where:

$r_i$ is the reward for sequence $i$
$\mu_r = \frac{1}{N}\sum_{i=1}^{N} r_i$ is the mean reward
$\sigma_r = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(r_i - \mu_r)^2}$ is the standard deviation

If $\sigma_r = 0$ (all rewards are identical), the advantages should be set to zero.

Step 2: Compute Length-Normalized Importance Ratios

For each sequence $i$ with $T_i$ tokens, compute the sequence-level importance ratio:

$$\rho_i = \exp\left(\frac{1}{T_i}\sum_{t=1}^{T_i}\left(\log\pi_{\theta}(a_t|s_t) - \log\pi_{\text{old}}(a_t|s_t)\right)\right)$$

This can be rewritten as: $$\rho_i = \exp\left(\frac{\sum_{t=1}^{T_i}\log\pi_{\theta}(a_t|s_t) - \sum_{t=1}^{T_i}\log\pi_{\text{old}}(a_t|s_t)}{T_i}\right)$$

The length normalization ($\div T_i$) ensures that longer sequences don't have disproportionately large importance ratios.

Step 3: Clip and Compute Objective

Apply the clipping function to bound the importance ratio:

$$\rho_i^{\text{clip}} = \text{clip}(\rho_i, 1-\epsilon, 1+\epsilon)$$

The per-sequence objective uses the pessimistic bound (minimum of clipped and unclipped):

$$L_i = \min(\rho_i \cdot A_i, \rho_i^{\text{clip}} \cdot A_i)$$

Step 4: Aggregate Objective

The final objective is the mean across all sequences:

$$L = \frac{1}{N}\sum_{i=1}^{N} L_i$$

Your Task: Implement a function that computes this sequence-level policy gradient objective given the log probabilities from new and old policies, sequence rewards, and the clipping parameter epsilon.

Example

Input

log_probs_new = [[1.0, 0.5], [0.8, 1.2]]
log_probs_old = [[0.0, 0.0], [0.0, 0.0]]
rewards = [0.9, 0.7]
epsilon = 0.2

Output

-0.7591

Explanation

Step 1: Compute Advantages Mean reward μ = (0.9 + 0.7) / 2 = 0.8 Std deviation σ = sqrt(((0.9-0.8)² + (0.7-0.8)²) / 2) = sqrt(0.01) = 0.1 Advantages: A₁ = (0.9 - 0.8) / 0.1 = 1.0, A₂ = (0.7 - 0.8) / 0.1 = -1.0

Step 2: Compute Importance Ratios Sequence 1: log_ratio_sum = (1.0 - 0.0) + (0.5 - 0.0) = 1.5, length = 2 ρ₁ = exp(1.5 / 2) = exp(0.75) ≈ 2.117 Sequence 2: log_ratio_sum = (0.8 - 0.0) + (1.2 - 0.0) = 2.0, length = 2 ρ₂ = exp(2.0 / 2) = exp(1.0) ≈ 2.718

Step 3: Clip Ratios Clipping range: [1 - 0.2, 1 + 0.2] = [0.8, 1.2] ρ₁_clipped = clip(2.117, 0.8, 1.2) = 1.2 ρ₂_clipped = clip(2.718, 0.8, 1.2) = 1.2

Step 4: Compute Per-Sequence Objectives L₁ = min(2.117 × 1.0, 1.2 × 1.0) = min(2.117, 1.2) = 1.2 L₂ = min(2.718 × (-1.0), 1.2 × (-1.0)) = min(-2.718, -1.2) = -2.718

Step 5: Final Objective L = (1.2 + (-2.718)) / 2 ≈ -0.759

The clipping prevents the positive advantage sequence from contributing too much, while the negative advantage sequence uses the unclipped (more pessimistic) ratio.

Example

Input

log_probs_new = [[0.5, 0.5], [0.5, 0.5]]
log_probs_old = [[0.5, 0.5], [0.5, 0.5]]
rewards = [1.0, -1.0]
epsilon = 0.2

Output

0.0

Explanation

When the new and old policies have identical log probabilities, the importance ratios equal 1.0 for all sequences.

Step 1: Compute Advantages Mean reward μ = (1.0 + (-1.0)) / 2 = 0.0 Std deviation σ = sqrt(((1.0-0.0)² + ((-1.0)-0.0)²) / 2) = sqrt(1.0) = 1.0 Advantages: A₁ = 1.0 / 1.0 = 1.0, A₂ = -1.0 / 1.0 = -1.0

Step 2: Compute Importance Ratios For both sequences, log_ratio_sum = 0, so ρ = exp(0) = 1.0

Step 3: No Clipping Needed Since ρ = 1.0 is within [0.8, 1.2], ρ_clipped = 1.0

Step 4: Compute Objectives L₁ = min(1.0 × 1.0, 1.0 × 1.0) = 1.0 L₂ = min(1.0 × (-1.0), 1.0 × (-1.0)) = -1.0

Step 5: Final Objective L = (1.0 + (-1.0)) / 2 = 0.0

The symmetric rewards and identical policies result in a zero objective.

Example

Input

log_probs_new = [[-0.5, -0.3], [-0.2, -0.4], [-0.1, -0.6]]
log_probs_old = [[-0.6, -0.4], [-0.3, -0.5], [-0.2, -0.7]]
rewards = [0.8, 0.5, 0.3]
epsilon = 0.1

Output

0.0

Explanation

This example demonstrates small policy differences with a tighter clipping threshold (ε=0.1).

Step 1: Compute Advantages Mean reward μ = (0.8 + 0.5 + 0.3) / 3 ≈ 0.533 Variance = ((0.8-0.533)² + (0.5-0.533)² + (0.3-0.533)²) / 3 ≈ 0.0422 Std deviation σ ≈ 0.205 Advantages: A₁ ≈ 1.30, A₂ ≈ -0.16, A₃ ≈ -1.14

Step 2: Compute Importance Ratios Seq 1: log_ratio = (-0.5+0.6) + (-0.3+0.4) = 0.2, ρ₁ = exp(0.2/2) = exp(0.1) ≈ 1.105 Seq 2: log_ratio = (-0.2+0.3) + (-0.4+0.5) = 0.2, ρ₂ = exp(0.2/2) ≈ 1.105 Seq 3: log_ratio = (-0.1+0.2) + (-0.6+0.7) = 0.2, ρ₃ = exp(0.2/2) ≈ 1.105

Step 3: Clip with ε=0.1 Clipping range: [0.9, 1.1] All ratios ≈ 1.105, clipped to 1.1

Step 4: Compute Pessimistic Objectives The final objective depends on the sign of advantages and which bound is more pessimistic.

Accepted0/0·0% Acceptance

Constraints

1 ≤ N ≤ 1000 (number of sequences)
1 ≤ Tᵢ ≤ 512 (token length for sequence i)
-100 ≤ log_probs_new[i][j], log_probs_old[i][j] ≤ 0 (log probabilities are non-positive)
-10 ≤ rewards[i] ≤ 10 (reward values)
0.01 ≤ epsilon ≤ 0.5 (clipping threshold)
log_probs_new and log_probs_old have the same shape
Length of rewards equals the number of sequences
All sequences have at least one token

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

epsilon =

0.2

rewards =

[0.9,0.7]

log_probs_new =

[[1,0.5],[0.8,1.2]]

log_probs_old =

[[0,0],[0,0]]

Loading problem...

101

00:00:00

Description

Editorial

Sequence-Level Policy Gradient Objective

HARD50 pts

Mathematical Formulation

Given a group of N sequences, the algorithm proceeds as follows:

Step 1: Compute Normalized Advantages

First, compute the standardized advantages from the raw rewards:

$$A_i = \frac{r_i - \mu_r}{\sigma_r}$$

where:

$r_i$ is the reward for sequence $i$
$\mu_r = \frac{1}{N}\sum_{i=1}^{N} r_i$ is the mean reward
$\sigma_r = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(r_i - \mu_r)^2}$ is the standard deviation

If $\sigma_r = 0$ (all rewards are identical), the advantages should be set to zero.

Step 2: Compute Length-Normalized Importance Ratios

For each sequence $i$ with $T_i$ tokens, compute the sequence-level importance ratio:

$$\rho_i = \exp\left(\frac{1}{T_i}\sum_{t=1}^{T_i}\left(\log\pi_{\theta}(a_t|s_t) - \log\pi_{\text{old}}(a_t|s_t)\right)\right)$$

This can be rewritten as: $$\rho_i = \exp\left(\frac{\sum_{t=1}^{T_i}\log\pi_{\theta}(a_t|s_t) - \sum_{t=1}^{T_i}\log\pi_{\text{old}}(a_t|s_t)}{T_i}\right)$$

The length normalization ($\div T_i$) ensures that longer sequences don't have disproportionately large importance ratios.

Step 3: Clip and Compute Objective

Apply the clipping function to bound the importance ratio:

$$\rho_i^{\text{clip}} = \text{clip}(\rho_i, 1-\epsilon, 1+\epsilon)$$

The per-sequence objective uses the pessimistic bound (minimum of clipped and unclipped):

$$L_i = \min(\rho_i \cdot A_i, \rho_i^{\text{clip}} \cdot A_i)$$

Step 4: Aggregate Objective

The final objective is the mean across all sequences:

$$L = \frac{1}{N}\sum_{i=1}^{N} L_i$$

Example

Input

log_probs_new = [[1.0, 0.5], [0.8, 1.2]]
log_probs_old = [[0.0, 0.0], [0.0, 0.0]]
rewards = [0.9, 0.7]
epsilon = 0.2

Output

-0.7591

Explanation

Step 3: Clip Ratios Clipping range: [1 - 0.2, 1 + 0.2] = [0.8, 1.2] ρ₁_clipped = clip(2.117, 0.8, 1.2) = 1.2 ρ₂_clipped = clip(2.718, 0.8, 1.2) = 1.2

Step 4: Compute Per-Sequence Objectives L₁ = min(2.117 × 1.0, 1.2 × 1.0) = min(2.117, 1.2) = 1.2 L₂ = min(2.718 × (-1.0), 1.2 × (-1.0)) = min(-2.718, -1.2) = -2.718

Step 5: Final Objective L = (1.2 + (-2.718)) / 2 ≈ -0.759

The clipping prevents the positive advantage sequence from contributing too much, while the negative advantage sequence uses the unclipped (more pessimistic) ratio.

Example

Input

log_probs_new = [[0.5, 0.5], [0.5, 0.5]]
log_probs_old = [[0.5, 0.5], [0.5, 0.5]]
rewards = [1.0, -1.0]
epsilon = 0.2

Output

0.0

Explanation

When the new and old policies have identical log probabilities, the importance ratios equal 1.0 for all sequences.

Step 2: Compute Importance Ratios For both sequences, log_ratio_sum = 0, so ρ = exp(0) = 1.0

Step 3: No Clipping Needed Since ρ = 1.0 is within [0.8, 1.2], ρ_clipped = 1.0

Step 4: Compute Objectives L₁ = min(1.0 × 1.0, 1.0 × 1.0) = 1.0 L₂ = min(1.0 × (-1.0), 1.0 × (-1.0)) = -1.0

Step 5: Final Objective L = (1.0 + (-1.0)) / 2 = 0.0

The symmetric rewards and identical policies result in a zero objective.

Example

Input

log_probs_new = [[-0.5, -0.3], [-0.2, -0.4], [-0.1, -0.6]]
log_probs_old = [[-0.6, -0.4], [-0.3, -0.5], [-0.2, -0.7]]
rewards = [0.8, 0.5, 0.3]
epsilon = 0.1

Output

0.0

Explanation

This example demonstrates small policy differences with a tighter clipping threshold (ε=0.1).

Step 3: Clip with ε=0.1 Clipping range: [0.9, 1.1] All ratios ≈ 1.105, clipped to 1.1

Step 4: Compute Pessimistic Objectives The final objective depends on the sign of advantages and which bound is more pessimistic.

Accepted0/0·0% Acceptance

Constraints

1 ≤ N ≤ 1000 (number of sequences)
1 ≤ Tᵢ ≤ 512 (token length for sequence i)
-100 ≤ log_probs_new[i][j], log_probs_old[i][j] ≤ 0 (log probabilities are non-positive)
-10 ≤ rewards[i] ≤ 10 (reward values)
0.01 ≤ epsilon ≤ 0.5 (clipping threshold)
log_probs_new and log_probs_old have the same shape
Length of rewards equals the number of sequences
All sequences have at least one token

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

epsilon =

0.2

rewards =

[0.9,0.7]

log_probs_new =

[[1,0.5],[0.8,1.2]]

log_probs_old =

[[0,0],[0,0]]

Sequence-Level Policy Gradient Objective

Mathematical Formulation

Step 1: Compute Normalized Advantages

Step 2: Compute Length-Normalized Importance Ratios

Step 3: Clip and Compute Objective

Step 4: Aggregate Objective

Hints

Sequence-Level Policy Gradient Objective

Mathematical Formulation

Step 1: Compute Normalized Advantages

Step 2: Compute Length-Normalized Importance Ratios

Step 3: Clip and Compute Objective

Step 4: Aggregate Objective

Hints