Loading content...
In sequential decision-making problems, agents must balance exploitation (choosing actions with known high rewards) against exploration (trying less-certain actions that might yield better outcomes). This fundamental challenge, known as the exploration-exploitation tradeoff, is central to reinforcement learning and multi-armed bandit problems.
One elegant solution to this dilemma is to assign each action an uncertainty bonus that accounts for how many times it has been tried. Actions selected fewer times have higher uncertainty, making them more attractive for exploration. The Optimistic Action Value for each action combines the observed average reward with this exploration bonus:
$$Q_{optimistic}(a) = \bar{R}(a) + c \cdot \sqrt{\frac{\ln(t)}{N(a)}}$$
Where:
The agent selects the action with the highest optimistic value. This strategy ensures that:
Your Task: Write a Python function that implements this action selection strategy. Given the selection counts for each action, their average rewards, the current timestep, and the exploration coefficient, return the index of the action that should be selected (the one with the highest optimistic value).
Note: If multiple actions have the same highest optimistic value, return the action with the smallest index.
selection_counts = [1, 1, 1, 1]
average_rewards = [1.0, 2.0, 1.5, 0.5]
time_step = 4
exploration_coefficient = 2.01With all four actions selected exactly once, they all have the same uncertainty bonus. The exploration bonus for each action is:
c × √(ln(4) / 1) = 2.0 × √1.386 ≈ 2.35
Therefore, the optimistic values are: • Action 0: 1.0 + 2.35 = 3.35 • Action 1: 2.0 + 2.35 = 4.35 • Action 2: 1.5 + 2.35 = 3.85 • Action 3: 0.5 + 2.35 = 2.85
Action 1 has the highest optimistic value (4.35), so it is selected.
selection_counts = [5, 5, 5]
average_rewards = [1.0, 10.0, 5.0]
time_step = 15
exploration_coefficient = 1.01All three actions have been selected 5 times each, so they share the same exploration bonus:
c × √(ln(15) / 5) = 1.0 × √(2.708 / 5) = 1.0 × √0.5416 ≈ 0.736
The optimistic values become: • Action 0: 1.0 + 0.736 = 1.736 • Action 1: 10.0 + 0.736 = 10.736 • Action 2: 5.0 + 0.736 = 5.736
Action 1, with the highest average reward, clearly dominates with an optimistic value of 10.736.
selection_counts = [10, 1, 10]
average_rewards = [5.0, 4.0, 5.0]
time_step = 21
exploration_coefficient = 2.01Here, action 1 has only been tried once while actions 0 and 2 have been tried 10 times each. Despite action 1 having a lower average reward (4.0 vs 5.0), its uncertainty bonus is much larger:
• Bonus for actions 0 and 2: 2.0 × √(ln(21) / 10) = 2.0 × √0.3045 ≈ 1.10 • Bonus for action 1: 2.0 × √(ln(21) / 1) = 2.0 × √3.045 ≈ 3.49
Optimistic values: • Action 0: 5.0 + 1.10 = 6.10 • Action 1: 4.0 + 3.49 = 7.49 • Action 2: 5.0 + 1.10 = 6.10
The under-explored action 1 is chosen because its high uncertainty bonus compensates for its lower average reward.
Constraints