Loading content...
One of the most fundamental challenges in reinforcement learning and decision-making under uncertainty is the exploration-exploitation dilemma. An agent operating in an unknown environment must balance two competing objectives:
This trade-off is elegantly captured in the multi-armed bandit problem—a classic framework where an agent repeatedly chooses among k different actions (analogous to pulling different slot machine levers), each with an unknown probability distribution of rewards. The goal is to maximize the cumulative reward over time.
A widely-used approach to address this dilemma is the stochastic action selection policy with an exploration parameter ε (epsilon), which operates as follows:
$$ \text{action} = \begin{cases} \text{random action from } {0, 1, ..., k-1} & \text{with probability } \varepsilon \ \underset{a}{\arg\max} , Q(a) & \text{with probability } 1 - \varepsilon \end{cases} $$
Where:
Implement a function that, given a list of value estimates for each action and an exploration rate ε, returns the index of the selected action according to this probabilistic strategy.
Key Properties:
value_estimates = [0.5, 2.3, 1.7]
exploration_rate = 0.01With exploration_rate = 0.0, the agent always exploits by selecting the action with the highest estimated value. The action values are:
• Action 0: 0.5 • Action 1: 2.3 (highest) • Action 2: 1.7
Since we are in pure exploitation mode (ε = 0), the function deterministically returns index 1, corresponding to the maximum value 2.3.
value_estimates = [1.0, 3.0, 2.0, 4.0]
exploration_rate = 0.50, 1, 2, or 3 (stochastic)With exploration_rate = 0.5, the agent has a 50% chance of exploring (choosing randomly among all 4 actions) and a 50% chance of exploiting (choosing action 3 with value 4.0).
• Action 0: 1.0 • Action 1: 3.0 • Action 2: 2.0 • Action 3: 4.0 (highest)
Possible outcomes:
Any action index in the valid range [0, 1, 2, 3] is a valid output for this stochastic scenario.
value_estimates = [1.5, 1.5, 1.5]
exploration_rate = 1.00, 1, or 2 (stochastic)With exploration_rate = 1.0, the agent always explores by choosing a random action, regardless of estimated values. Since all three actions have equal value estimates (1.5), the greedy choice is arbitrary anyway.
In pure exploration mode, any of the valid action indices (0, 1, or 2) should have an equal probability of being returned. The output is non-deterministic and any valid index is acceptable.
Constraints