Unbiased Policy Divergence Estimation for Regularized Optimization (Easy) — Practice with Code Visualizer

In reinforcement learning and policy optimization algorithms, it is crucial to prevent a learned policy from deviating too drastically from a stable reference distribution. This is achieved by incorporating a divergence penalty into the objective function, which regularizes the optimization process and ensures smooth, controlled policy updates.

The unbiased divergence estimator computes a per-sample measure of how different the current policy distribution (\pi_\theta) is from a reference policy distribution (\pi_{\text{ref}}). This measure is particularly important in algorithms like Group Relative Policy Optimization (GRPO) and Proximal Policy Optimization (PPO), where staying close to the reference policy prevents catastrophic forgetting and maintains training stability.

Mathematical Formulation:

For each sample, the divergence estimator is computed using the following formula:

$$D_i = r_i - \log(r_i) - 1$$

where the probability ratio (r_i) is defined as:

$$r_i = \frac{\pi_{\text{ref}}(a_i)}{\pi_\theta(a_i)}$$

Here:

(\pi_\theta(a_i)) is the probability assigned by the current policy to action (a_i)
(\pi_{\text{ref}}(a_i)) is the probability assigned by the reference policy to the same action

Intuition Behind the Formula:

This estimator has several elegant mathematical properties:

Non-negative: The divergence (D_i \geq 0) for all valid probability ratios
Zero at equality: When (\pi_\theta = \pi_{\text{ref}}), we get (r = 1), and (D = 1 - \log(1) - 1 = 0)
Convex penalty: The function (f(r) = r - \log(r) - 1) is convex, penalizing both over-confident and under-confident deviations

Your Task:

Implement a function that computes the per-sample policy divergence values. For each element in the input arrays, calculate the divergence using the formula above and return the results rounded to 4 decimal places.

Step 1: Compute the probability ratio r = pi_ref / pi_theta = 0.4 / 0.8 = 0.5

Step 2: Apply the divergence formula D = r - log(r) - 1 D = 0.5 - log(0.5) - 1 D = 0.5 - (-0.6931) - 1 D = 0.5 + 0.6931 - 1 D = 0.1931

Interpretation: The current policy assigns probability 0.8 to this action, which is higher than the reference policy's 0.4. The divergence of 0.1931 penalizes this overconfidence, encouraging the policy to stay closer to the reference distribution.

Sample 1: pi_theta = 0.5, pi_ref = 0.5 r = 0.5 / 0.5 = 1.0 D = 1.0 - log(1.0) - 1 = 1.0 - 0 - 1 = 0.0 (No divergence when policies match exactly)

Sample 2: pi_theta = 0.6, pi_ref = 0.3 r = 0.3 / 0.6 = 0.5 D = 0.5 - log(0.5) - 1 = 0.5 + 0.6931 - 1 = 0.1931 (Penalty for assigning higher probability than reference)

Sample 3: pi_theta = 0.7, pi_ref = 0.7 r = 0.7 / 0.7 = 1.0 D = 1.0 - log(1.0) - 1 = 0.0 (No divergence when policies match exactly)

When the current policy exactly matches the reference policy for all samples, every probability ratio equals 1.0:

For all samples: r = pi_ref / pi_theta = 1.0 D = 1.0 - log(1.0) - 1 = 1.0 - 0 - 1 = 0.0

Result: All divergence values are zero, indicating perfect alignment between the policies. This is the ideal scenario where no regularization penalty is incurred.