Loading problem...
In reinforcement learning and decision-making systems, a fundamental challenge is learning which actions to take when operating in uncertain environments with limited feedback. The preference-based action selection paradigm addresses this by maintaining numerical preference values for each available action and converting these preferences into selection probabilities using the softmax function.
Unlike simpler selection strategies that directly estimate action values, preference-based methods learn a relative preference for each action. These preferences are then transformed into a probability distribution using the softmax function:
$$\pi(a) = \frac{e^{H(a)}}{\sum_{b=1}^{k} e^{H(b)}}$$
Where:
The softmax function ensures that actions with higher preferences are selected more frequently, while still maintaining exploration by never assigning zero probability to any action.
After taking an action and receiving a reward, the preferences are updated using a gradient-based learning rule. This update mechanism increases the preference for actions that yield rewards better than the baseline (average reward) and decreases preferences for actions that underperform:
$$H_{t+1}(A_t) = H_t(A_t) + \alpha \cdot (R_t - \bar{R}_t) \cdot (1 - \pi_t(A_t))$$
$$H_{t+1}(a) = H_t(a) - \alpha \cdot (R_t - \bar{R}_t) \cdot \pi_t(a) \quad \text{for all } a eq A_t$$
Where:
Implement a PreferenceActionSelector class that:
The class should track a running baseline as the incremental mean of all rewards received, updating it with each new reward observation.
num_actions = 3, alpha = 0.1
actions_rewards = [[1, 1.0]][0.33, 0.33, 0.34]Starting with 3 actions and all preferences at 0, the initial softmax gives equal probabilities of 0.333... for each action.
After action 1 receives reward 1.0: • The baseline is updated: R̄ = 1.0 (first reward) • Since the reward equals the baseline, the advantage (R - R̄) = 0 • No preference update occurs as the gradient is zero when advantage is zero • The softmax probabilities remain approximately [0.33, 0.33, 0.34] (minor rounding differences)
num_actions = 4, alpha = 0.1
actions_rewards = [[0, 2.0], [1, -1.0], [2, 3.0], [0, 1.5]][0.25, 0.21, 0.29, 0.25]Starting with 4 actions and equal preferences:
The accumulated updates result in action 2 having the highest probability (0.29), action 1 the lowest (0.21), and actions 0 and 3 at the baseline (0.25).
num_actions = 2, alpha = 0.2
actions_rewards = [[0, 1.0], [1, 0.5], [0, 1.5], [0, 2.0]][0.57, 0.43]With a higher learning rate (α = 0.2), preference updates are more aggressive:
After these updates, action 0's preference has grown relative to action 1, resulting in softmax probabilities of approximately [0.57, 0.43]. The higher learning rate causes faster convergence toward favoring the consistently higher-rewarding action.
Constraints