Loading content...
In modern reinforcement learning systems designed for fine-tuning large language models, a crucial technique involves relative advantage estimation within generation groups. When optimizing a language model's policy, instead of using raw reward signals directly, we compute how each generated output compares to other outputs for the same input prompt.
This technique is fundamental to Group-Based Relative Policy Optimization, a training paradigm where the model generates multiple candidate outputs (a "group") for each prompt, and each output's advantage is computed relative to the group's performance. This relative scoring mechanism ensures that policy gradients push the model toward outputs that are better than the group average, rather than simply toward outputs with high absolute rewards.
The Normalization Process:
Given a group of G outputs with corresponding reward scores, the advantage for each output is computed using z-score normalization:
$$A_i = \frac{r_i - \mu}{\sigma}$$
Where:
This normalization transforms rewards into a distribution centered at zero, where:
Why This Matters:
Raw reward values can vary significantly across different prompts, making it difficult to combine gradients from different training examples. By normalizing within each group, we ensure that every prompt contributes equally to the training signal, regardless of the absolute reward scale. This prevents prompts with larger reward ranges from dominating the optimization process.
Your Task: Write a function that takes a list of reward values for a group of model outputs and returns the normalized advantage scores. Round each result to 4 decimal places for numerical precision.
rewards = [0.0, 1.0, 0.0, 1.0][-1.0, 1.0, -1.0, 1.0]For this group of 4 outputs:
• Mean (μ) = (0.0 + 1.0 + 0.0 + 1.0) / 4 = 0.5 • Standard Deviation (σ) = √[((0-0.5)² + (1-0.5)² + (0-0.5)² + (1-0.5)²) / 4] = √(0.25) = 0.5
Normalized advantages: • Output 1: (0.0 - 0.5) / 0.5 = -1.0 (worse than average) • Output 2: (1.0 - 0.5) / 0.5 = 1.0 (better than average) • Output 3: (0.0 - 0.5) / 0.5 = -1.0 (worse than average) • Output 4: (1.0 - 0.5) / 0.5 = 1.0 (better than average)
This clearly separates good responses (positive advantage) from poor ones (negative advantage).
rewards = [0.2, 0.5, 0.8, 0.3, 0.7][-1.3156, 0.0, 1.3156, -0.8771, 0.8771]For this group of 5 outputs:
• Mean (μ) = (0.2 + 0.5 + 0.8 + 0.3 + 0.7) / 5 = 0.5 • Variance = [(0.2-0.5)² + (0.5-0.5)² + (0.8-0.5)² + (0.3-0.5)² + (0.7-0.5)²] / 5 = 0.056 • Standard Deviation (σ) ≈ 0.2366
Normalized advantages: • Output 1: (0.2 - 0.5) / 0.2366 ≈ -1.3156 • Output 2: (0.5 - 0.5) / 0.2366 = 0.0 (exactly at the mean) • Output 3: (0.8 - 0.5) / 0.2366 ≈ 1.3156 • Output 4: (0.3 - 0.5) / 0.2366 ≈ -0.8771 • Output 5: (0.7 - 0.5) / 0.2366 ≈ 0.8771
Notice that the output with reward 0.5 (exactly the mean) gets advantage 0.0.
rewards = [-0.5, 0.0, 0.5, 1.0][-1.3416, -0.4472, 0.4472, 1.3416]This example includes negative rewards:
• Mean (μ) = (-0.5 + 0.0 + 0.5 + 1.0) / 4 = 0.25 • Standard Deviation (σ) ≈ 0.559
Normalized advantages: • Output 1: (-0.5 - 0.25) / 0.559 ≈ -1.3416 • Output 2: (0.0 - 0.25) / 0.559 ≈ -0.4472 • Output 3: (0.5 - 0.25) / 0.559 ≈ 0.4472 • Output 4: (1.0 - 0.25) / 0.559 ≈ 1.3416
The advantages always sum to approximately zero (due to centering), and the transformation preserves the relative ordering of outputs.
Constraints