0/318

00:00:00

Description

Editorial

Intra-Group Advantage Normalization for Policy Training

EASY10 pts

In modern reinforcement learning systems designed for fine-tuning large language models, a crucial technique involves relative advantage estimation within generation groups. When optimizing a language model's policy, instead of using raw reward signals directly, we compute how each generated output compares to other outputs for the same input prompt.

This technique is fundamental to Group-Based Relative Policy Optimization, a training paradigm where the model generates multiple candidate outputs (a "group") for each prompt, and each output's advantage is computed relative to the group's performance. This relative scoring mechanism ensures that policy gradients push the model toward outputs that are better than the group average, rather than simply toward outputs with high absolute rewards.

The Normalization Process:

Given a group of G outputs with corresponding reward scores, the advantage for each output is computed using z-score normalization:

$$A_i = \frac{r_i - \mu}{\sigma}$$

Where:

rᵢ is the reward for the i-th output
μ (mu) is the mean reward across all outputs in the group
σ (sigma) is the standard deviation of rewards in the group
Aᵢ is the resulting advantage score

This normalization transforms rewards into a distribution centered at zero, where:

Positive advantages indicate outputs that performed better than the group average
Negative advantages indicate outputs that performed worse than the group average
The magnitude reflects how many standard deviations an output deviates from the mean

Why This Matters:

Raw reward values can vary significantly across different prompts, making it difficult to combine gradients from different training examples. By normalizing within each group, we ensure that every prompt contributes equally to the training signal, regardless of the absolute reward scale. This prevents prompts with larger reward ranges from dominating the optimization process.

Your Task: Write a function that takes a list of reward values for a group of model outputs and returns the normalized advantage scores. Round each result to 4 decimal places for numerical precision.

Example

Input

rewards = [0.0, 1.0, 0.0, 1.0]

Output

[-1.0, 1.0, -1.0, 1.0]

Explanation

For this group of 4 outputs:

• Mean (μ) = (0.0 + 1.0 + 0.0 + 1.0) / 4 = 0.5 • Standard Deviation (σ) = √[((0-0.5)² + (1-0.5)² + (0-0.5)² + (1-0.5)²) / 4] = √(0.25) = 0.5

Normalized advantages: • Output 1: (0.0 - 0.5) / 0.5 = -1.0 (worse than average) • Output 2: (1.0 - 0.5) / 0.5 = 1.0 (better than average) • Output 3: (0.0 - 0.5) / 0.5 = -1.0 (worse than average) • Output 4: (1.0 - 0.5) / 0.5 = 1.0 (better than average)

This clearly separates good responses (positive advantage) from poor ones (negative advantage).

Example

Input

rewards = [0.2, 0.5, 0.8, 0.3, 0.7]

Output

[-1.3156, 0.0, 1.3156, -0.8771, 0.8771]

Explanation

For this group of 5 outputs:

• Mean (μ) = (0.2 + 0.5 + 0.8 + 0.3 + 0.7) / 5 = 0.5 • Variance = [(0.2-0.5)² + (0.5-0.5)² + (0.8-0.5)² + (0.3-0.5)² + (0.7-0.5)²] / 5 = 0.056 • Standard Deviation (σ) ≈ 0.2366

Normalized advantages: • Output 1: (0.2 - 0.5) / 0.2366 ≈ -1.3156 • Output 2: (0.5 - 0.5) / 0.2366 = 0.0 (exactly at the mean) • Output 3: (0.8 - 0.5) / 0.2366 ≈ 1.3156 • Output 4: (0.3 - 0.5) / 0.2366 ≈ -0.8771 • Output 5: (0.7 - 0.5) / 0.2366 ≈ 0.8771

Notice that the output with reward 0.5 (exactly the mean) gets advantage 0.0.

Example

Input

rewards = [-0.5, 0.0, 0.5, 1.0]

Output

[-1.3416, -0.4472, 0.4472, 1.3416]

Explanation

This example includes negative rewards:

• Mean (μ) = (-0.5 + 0.0 + 0.5 + 1.0) / 4 = 0.25 • Standard Deviation (σ) ≈ 0.559

Normalized advantages: • Output 1: (-0.5 - 0.25) / 0.559 ≈ -1.3416 • Output 2: (0.0 - 0.25) / 0.559 ≈ -0.4472 • Output 3: (0.5 - 0.25) / 0.559 ≈ 0.4472 • Output 4: (1.0 - 0.25) / 0.559 ≈ 1.3416

The advantages always sum to approximately zero (due to centering), and the transformation preserves the relative ordering of outputs.

Accepted0/0·0% Acceptance

Constraints

1 ≤ len(rewards) ≤ 1000
-100 ≤ rewards[i] ≤ 100
Standard deviation will always be greater than 0 (not all rewards are identical)
Use population standard deviation (divide by N, not N-1)
Round results to 4 decimal places

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

rewards =

[0.2,0.5,0.8,0.3,0.7]

Intra-Group Advantage Normalization for Policy Training

Hints

Intra-Group Advantage Normalization for Policy Training

Hints