0/318

00:00:00

Description

Editorial

Exploration-Exploitation Action Chooser

MEDIUM20 pts

One of the most fundamental challenges in reinforcement learning and decision-making under uncertainty is the exploration-exploitation dilemma. An agent operating in an unknown environment must balance two competing objectives:

Exploitation: Taking the action currently believed to yield the highest reward based on accumulated knowledge
Exploration: Trying other actions to gather more information that might reveal better long-term strategies

This trade-off is elegantly captured in the multi-armed bandit problem—a classic framework where an agent repeatedly chooses among k different actions (analogous to pulling different slot machine levers), each with an unknown probability distribution of rewards. The goal is to maximize the cumulative reward over time.

The Probabilistic Selection Strategy

A widely-used approach to address this dilemma is the stochastic action selection policy with an exploration parameter ε (epsilon), which operates as follows:

$$ \text{action} = \begin{cases} \text{random action from } {0, 1, ..., k-1} & \text{with probability } \varepsilon \ \underset{a}{\arg\max} , Q(a) & \text{with probability } 1 - \varepsilon \end{cases} $$

Where:

Q(a) represents the estimated value (expected reward) of action a
ε ∈ [0, 1] controls the exploration-exploitation balance
When ε = 0: Pure exploitation (always choose the best-known action)
When ε = 1: Pure exploration (always choose randomly)

Your Task

Implement a function that, given a list of value estimates for each action and an exploration rate ε, returns the index of the selected action according to this probabilistic strategy.

Key Properties:

For deterministic cases (ε = 0), always return the index of the action with the highest estimated value
For ε = 1 (pure exploration), any valid action index may be returned
For intermediate ε values, the action selection should be stochastic—sometimes exploring, sometimes exploiting

Example

Input

value_estimates = [0.5, 2.3, 1.7]
exploration_rate = 0.0

Output

1

Explanation

With exploration_rate = 0.0, the agent always exploits by selecting the action with the highest estimated value. The action values are:

• Action 0: 0.5 • Action 1: 2.3 (highest) • Action 2: 1.7

Since we are in pure exploitation mode (ε = 0), the function deterministically returns index 1, corresponding to the maximum value 2.3.

Example

Input

value_estimates = [1.0, 3.0, 2.0, 4.0]
exploration_rate = 0.5

Output

0, 1, 2, or 3 (stochastic)

Explanation

With exploration_rate = 0.5, the agent has a 50% chance of exploring (choosing randomly among all 4 actions) and a 50% chance of exploiting (choosing action 3 with value 4.0).

• Action 0: 1.0 • Action 1: 3.0 • Action 2: 2.0 • Action 3: 4.0 (highest)

Possible outcomes:

If exploring (probability 0.5): randomly return 0, 1, 2, or 3
If exploiting (probability 0.5): return 3

Any action index in the valid range [0, 1, 2, 3] is a valid output for this stochastic scenario.

Example

Input

value_estimates = [1.5, 1.5, 1.5]
exploration_rate = 1.0

Output

0, 1, or 2 (stochastic)

Explanation

With exploration_rate = 1.0, the agent always explores by choosing a random action, regardless of estimated values. Since all three actions have equal value estimates (1.5), the greedy choice is arbitrary anyway.

In pure exploration mode, any of the valid action indices (0, 1, or 2) should have an equal probability of being returned. The output is non-deterministic and any valid index is acceptable.

Accepted0/0·0% Acceptance

Constraints

1 ≤ length of value_estimates ≤ 1000 (number of available actions)
0 ≤ exploration_rate ≤ 1 (probability of exploration)
-10⁶ ≤ value_estimates[i] ≤ 10⁶ (action value estimates)
All value estimates are valid floating-point numbers
When exploration_rate = 0, the output must be deterministic (the greedy action)
When multiple actions share the maximum value, any of them is acceptable for the greedy choice

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

Q =

[0.5,2.3,1.7]

epsilon =

The Probabilistic Selection Strategy

A widely-used approach to address this dilemma is the stochastic action selection policy with an exploration parameter ε (epsilon), which operates as follows:

Where:

Q(a) represents the estimated value (expected reward) of action a

ε ∈ [0, 1] controls the exploration-exploitation balance

When ε = 0: Pure exploitation (always choose the best-known action)

When ε = 1: Pure exploration (always choose randomly)

Your Task

Implement a function that, given a list of value estimates for each action and an exploration rate ε, returns the index of the selected action according to this probabilistic strategy.

Key Properties:

For deterministic cases (ε = 0), always return the index of the action with the highest estimated value

For ε = 1 (pure exploration), any valid action index may be returned

For intermediate ε values, the action selection should be stochastic—sometimes exploring, sometimes exploiting

Exploration-Exploitation Action Chooser

The Probabilistic Selection Strategy

Your Task

Hints

Exploration-Exploitation Action Chooser

The Probabilistic Selection Strategy

Your Task

Hints