0/318

00:00:00

Description

Editorial

Exploration Bonus Action Selector

EASY10 pts

In sequential decision-making problems, agents must balance exploitation (choosing actions with known high rewards) against exploration (trying less-certain actions that might yield better outcomes). This fundamental challenge, known as the exploration-exploitation tradeoff, is central to reinforcement learning and multi-armed bandit problems.

One elegant solution to this dilemma is to assign each action an uncertainty bonus that accounts for how many times it has been tried. Actions selected fewer times have higher uncertainty, making them more attractive for exploration. The Optimistic Action Value for each action combines the observed average reward with this exploration bonus:

$$Q_{optimistic}(a) = \bar{R}(a) + c \cdot \sqrt{\frac{\ln(t)}{N(a)}}$$

Where:

$\bar{R}(a)$ is the average reward received when selecting action $a$
$N(a)$ is the number of times action $a$ has been selected
$t$ is the current timestep (total number of action selections made)
$c$ is the exploration coefficient that controls the exploration-exploitation balance

The agent selects the action with the highest optimistic value. This strategy ensures that:

Actions with high average rewards are preferred (exploitation)
Actions tried fewer times get a bonus that encourages exploration
As time progresses, the exploration bonus naturally decreases, favoring exploitation

Your Task: Write a Python function that implements this action selection strategy. Given the selection counts for each action, their average rewards, the current timestep, and the exploration coefficient, return the index of the action that should be selected (the one with the highest optimistic value).

Note: If multiple actions have the same highest optimistic value, return the action with the smallest index.

Example

Input

selection_counts = [1, 1, 1, 1]
average_rewards = [1.0, 2.0, 1.5, 0.5]
time_step = 4
exploration_coefficient = 2.0

Output

1

Explanation

With all four actions selected exactly once, they all have the same uncertainty bonus. The exploration bonus for each action is:

c × √(ln(4) / 1) = 2.0 × √1.386 ≈ 2.35

Therefore, the optimistic values are: • Action 0: 1.0 + 2.35 = 3.35 • Action 1: 2.0 + 2.35 = 4.35 • Action 2: 1.5 + 2.35 = 3.85 • Action 3: 0.5 + 2.35 = 2.85

Action 1 has the highest optimistic value (4.35), so it is selected.

Example

Input

selection_counts = [5, 5, 5]
average_rewards = [1.0, 10.0, 5.0]
time_step = 15
exploration_coefficient = 1.0

Output

1

Explanation

All three actions have been selected 5 times each, so they share the same exploration bonus:

c × √(ln(15) / 5) = 1.0 × √(2.708 / 5) = 1.0 × √0.5416 ≈ 0.736

The optimistic values become: • Action 0: 1.0 + 0.736 = 1.736 • Action 1: 10.0 + 0.736 = 10.736 • Action 2: 5.0 + 0.736 = 5.736

Action 1, with the highest average reward, clearly dominates with an optimistic value of 10.736.

Example

Input

selection_counts = [10, 1, 10]
average_rewards = [5.0, 4.0, 5.0]
time_step = 21
exploration_coefficient = 2.0

Output

1

Explanation

Here, action 1 has only been tried once while actions 0 and 2 have been tried 10 times each. Despite action 1 having a lower average reward (4.0 vs 5.0), its uncertainty bonus is much larger:

• Bonus for actions 0 and 2: 2.0 × √(ln(21) / 10) = 2.0 × √0.3045 ≈ 1.10 • Bonus for action 1: 2.0 × √(ln(21) / 1) = 2.0 × √3.045 ≈ 3.49

Optimistic values: • Action 0: 5.0 + 1.10 = 6.10 • Action 1: 4.0 + 3.49 = 7.49 • Action 2: 5.0 + 1.10 = 6.10

The under-explored action 1 is chosen because its high uncertainty bonus compensates for its lower average reward.

Accepted0/0·0% Acceptance

Constraints

1 ≤ number of actions ≤ 1000
1 ≤ selection_counts[i] ≤ 10⁶ for all actions
-10⁶ ≤ average_rewards[i] ≤ 10⁶
selection_counts and average_rewards have the same length
time_step equals the sum of all selection counts
0 < exploration_coefficient ≤ 100
All selection counts are positive (no unselected actions)

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

c =

t =

counts =

[1,1,1,1]

values =

[1,2,1.5,0.5]

Exploration Bonus Action Selector

Hints

Exploration Bonus Action Selector

Hints