Loading problem...
In modern reinforcement learning from human feedback (RLHF) and adversarial training paradigms, a critical component is the reward computation mechanism that guides both the policy (the model generating outputs) and the critic (the evaluator comparing outputs). This problem focuses on implementing such a reward system in an adversarial framework.
Consider a training scenario where a language model (the policy) generates responses, and these responses are compared against expert demonstrations by a relativistic critic. The critic's job is to determine which response is better: the policy's or the expert's. This creates an adversarial game:
Reward Mechanics:
Given the critic's prediction and the actual position of the expert answer, compute both reward signals:
Critic Reward: The critic receives a reward based on how accurately it identified the expert response:
Policy Reward: The policy receives a reward based on how effectively it fooled the critic:
Determining Correctness:
The expert_position parameter indicates where the expert answer is located (position 1 or 2). The critic can predict:
'expert': Believes the answer in position 1 is from the expert (if expert_position = 1, critic is correct)'policy': Believes the answer in position 2 is better/from expert (if expert_position = 2, critic is wrong about expert being in position 1)'tie': Believes both answers are equally goodMatch Logic:
critic_prediction = 'expert' and expert_position = 1: Critic is correct → (1.0, 0.0)critic_prediction = 'expert' and expert_position = 2: Critic is wrong → (0.0, 1.0)critic_prediction = 'policy' and expert_position = 1: Critic is wrong → (0.0, 1.0)critic_prediction = 'policy' and expert_position = 2: Critic is correct → (1.0, 0.0)critic_prediction = 'tie': Both receive partial rewards → (tau_critic, tau_policy)Your Task:
Implement a function that computes the appropriate reward tuple (critic_reward, policy_reward) based on these rules.
critic_prediction = 'expert', expert_position = 1, tau_critic = 0.5, tau_policy = 0.5[1.0, 0.0]The critic correctly identifies that the expert answer is in position 1. Since the prediction matches the ground truth, the critic receives the full reward of 1.0, while the policy receives 0.0 because it failed to deceive the critic.
critic_prediction = 'policy', expert_position = 1, tau_critic = 0.5, tau_policy = 0.5[0.0, 1.0]The critic incorrectly believes the policy's answer (position 2) is better when the expert is actually in position 1. The critic receives 0.0 for the wrong judgment, while the policy receives the full reward of 1.0 for successfully fooling the critic.
critic_prediction = 'tie', expert_position = 2, tau_critic = 0.5, tau_policy = 0.5[0.5, 0.5]When the critic predicts a tie, both the critic and policy receive their respective partial rewards. The critic gets tau_critic = 0.5 (acknowledging uncertainty is partially correct), and the policy gets tau_policy = 0.5 (achieving partial confusion of the critic).
Constraints