Loading problem...
In reinforcement learning and policy optimization algorithms, it is crucial to prevent a learned policy from deviating too drastically from a stable reference distribution. This is achieved by incorporating a divergence penalty into the objective function, which regularizes the optimization process and ensures smooth, controlled policy updates.
The unbiased divergence estimator computes a per-sample measure of how different the current policy distribution (\pi_\theta) is from a reference policy distribution (\pi_{\text{ref}}). This measure is particularly important in algorithms like Group Relative Policy Optimization (GRPO) and Proximal Policy Optimization (PPO), where staying close to the reference policy prevents catastrophic forgetting and maintains training stability.
Mathematical Formulation:
For each sample, the divergence estimator is computed using the following formula:
$$D_i = r_i - \log(r_i) - 1$$
where the probability ratio (r_i) is defined as:
$$r_i = \frac{\pi_{\text{ref}}(a_i)}{\pi_\theta(a_i)}$$
Here:
Intuition Behind the Formula:
This estimator has several elegant mathematical properties:
Your Task:
Implement a function that computes the per-sample policy divergence values. For each element in the input arrays, calculate the divergence using the formula above and return the results rounded to 4 decimal places.
pi_theta = np.array([0.8])
pi_ref = np.array([0.4])[0.1931]Step 1: Compute the probability ratio r = pi_ref / pi_theta = 0.4 / 0.8 = 0.5
Step 2: Apply the divergence formula D = r - log(r) - 1 D = 0.5 - log(0.5) - 1 D = 0.5 - (-0.6931) - 1 D = 0.5 + 0.6931 - 1 D = 0.1931
Interpretation: The current policy assigns probability 0.8 to this action, which is higher than the reference policy's 0.4. The divergence of 0.1931 penalizes this overconfidence, encouraging the policy to stay closer to the reference distribution.
pi_theta = np.array([0.5, 0.6, 0.7])
pi_ref = np.array([0.5, 0.3, 0.7])[0.0, 0.1931, 0.0]Sample 1: pi_theta = 0.5, pi_ref = 0.5 r = 0.5 / 0.5 = 1.0 D = 1.0 - log(1.0) - 1 = 1.0 - 0 - 1 = 0.0 (No divergence when policies match exactly)
Sample 2: pi_theta = 0.6, pi_ref = 0.3 r = 0.3 / 0.6 = 0.5 D = 0.5 - log(0.5) - 1 = 0.5 + 0.6931 - 1 = 0.1931 (Penalty for assigning higher probability than reference)
Sample 3: pi_theta = 0.7, pi_ref = 0.7 r = 0.7 / 0.7 = 1.0 D = 1.0 - log(1.0) - 1 = 0.0 (No divergence when policies match exactly)
pi_theta = np.array([0.4, 0.5, 0.6])
pi_ref = np.array([0.4, 0.5, 0.6])[0.0, 0.0, 0.0]When the current policy exactly matches the reference policy for all samples, every probability ratio equals 1.0:
For all samples: r = pi_ref / pi_theta = 1.0 D = 1.0 - log(1.0) - 1 = 1.0 - 0 - 1 = 0.0
Result: All divergence values are zero, indicating perfect alignment between the policies. This is the ideal scenario where no regularization penalty is incurred.
Constraints