Adversarial Critic Reward Computation (Medium) — Practice with Code Visualizer

In modern reinforcement learning from human feedback (RLHF) and adversarial training paradigms, a critical component is the reward computation mechanism that guides both the policy (the model generating outputs) and the critic (the evaluator comparing outputs). This problem focuses on implementing such a reward system in an adversarial framework.

Consider a training scenario where a language model (the policy) generates responses, and these responses are compared against expert demonstrations by a relativistic critic. The critic's job is to determine which response is better: the policy's or the expert's. This creates an adversarial game:

The Policy wants to generate responses so convincing that the critic mistakes them for expert responses
The Critic wants to correctly identify which response came from the expert

Reward Mechanics:

Given the critic's prediction and the actual position of the expert answer, compute both reward signals:

Critic Reward: The critic receives a reward based on how accurately it identified the expert response:
- Full reward (1.0): When the critic correctly identifies the expert answer
- Zero reward (0.0): When the critic incorrectly chooses the policy answer as the expert
- Partial reward (tau_critic): When the critic predicts a 'tie'
Policy Reward: The policy receives a reward based on how effectively it fooled the critic:
- Full reward (1.0): When the critic mistakenly identifies the policy answer as the expert
- Zero reward (0.0): When the critic correctly identifies the expert answer
- Partial reward (tau_policy): When the critic predicts a 'tie'

Determining Correctness:

The expert_position parameter indicates where the expert answer is located (position 1 or 2). The critic can predict:

'expert': Believes the answer in position 1 is from the expert (if expert_position = 1, critic is correct)
'policy': Believes the answer in position 2 is better/from expert (if expert_position = 2, critic is wrong about expert being in position 1)
'tie': Believes both answers are equally good

Match Logic:

If critic_prediction = 'expert' and expert_position = 1: Critic is correct → (1.0, 0.0)
If critic_prediction = 'expert' and expert_position = 2: Critic is wrong → (0.0, 1.0)
If critic_prediction = 'policy' and expert_position = 1: Critic is wrong → (0.0, 1.0)
If critic_prediction = 'policy' and expert_position = 2: Critic is correct → (1.0, 0.0)
If critic_prediction = 'tie': Both receive partial rewards → (tau_critic, tau_policy)

Your Task: Implement a function that computes the appropriate reward tuple (critic_reward, policy_reward) based on these rules.

The Policy wants to generate responses so convincing that the critic mistakes them for expert responses
The Critic wants to correctly identify which response came from the expert