Preference-Based Action Selector with Softmax Probabilities (Medium) — Practice with Code Visualizer

In reinforcement learning and decision-making systems, a fundamental challenge is learning which actions to take when operating in uncertain environments with limited feedback. The preference-based action selection paradigm addresses this by maintaining numerical preference values for each available action and converting these preferences into selection probabilities using the softmax function.

The Core Concept

Unlike simpler selection strategies that directly estimate action values, preference-based methods learn a relative preference for each action. These preferences are then transformed into a probability distribution using the softmax function:

$$\pi(a) = \frac{e^{H(a)}}{\sum_{b=1}^{k} e^{H(b)}}$$

Where:

H(a) represents the learned preference for action a
π(a) is the probability of selecting action a
k is the total number of available actions

The softmax function ensures that actions with higher preferences are selected more frequently, while still maintaining exploration by never assigning zero probability to any action.

The Learning Mechanism

After taking an action and receiving a reward, the preferences are updated using a gradient-based learning rule. This update mechanism increases the preference for actions that yield rewards better than the baseline (average reward) and decreases preferences for actions that underperform:

$$H_{t+1}(A_t) = H_t(A_t) + \alpha \cdot (R_t - \bar{R}_t) \cdot (1 - \pi_t(A_t))$$

$$H_{t+1}(a) = H_t(a) - \alpha \cdot (R_t - \bar{R}_t) \cdot \pi_t(a) \quad \text{for all } a eq A_t$$

Where:

α is the learning rate controlling the step size of updates
Rₜ is the reward received at time t
R̄ₜ is the baseline (typically the average of all rewards received so far)
Aₜ is the action taken at time t
πₜ(a) is the probability of action a at time t

Your Task

Implement a PreferenceActionSelector class that:

Initializes with a specified number of actions and learning rate, starting all preferences at 0
Computes softmax probabilities over the current preferences (rounded to 2 decimal places)
Selects actions by sampling from the softmax probability distribution
Updates preferences using the gradient-based rule after receiving reward feedback

The class should track a running baseline as the incremental mean of all rewards received, updating it with each new reward observation.

Starting with 3 actions and all preferences at 0, the initial softmax gives equal probabilities of 0.333... for each action.

After action 1 receives reward 1.0: • The baseline is updated: R̄ = 1.0 (first reward) • Since the reward equals the baseline, the advantage (R - R̄) = 0 • No preference update occurs as the gradient is zero when advantage is zero • The softmax probabilities remain approximately [0.33, 0.33, 0.34] (minor rounding differences)

Starting with 4 actions and equal preferences:

Action 0 gets reward 2.0 → baseline = 2.0, advantage = 0 → no significant update
Action 1 gets reward -1.0 → baseline = 0.5, advantage = -1.5 → action 1 preference decreases
Action 2 gets reward 3.0 → baseline ≈ 1.33, advantage ≈ 1.67 → action 2 preference increases
Action 0 gets reward 1.5 → baseline ≈ 1.375, advantage ≈ 0.125 → small increase for action 0

The accumulated updates result in action 2 having the highest probability (0.29), action 1 the lowest (0.21), and actions 0 and 3 at the baseline (0.25).

With a higher learning rate (α = 0.2), preference updates are more aggressive:

Action 0 gets reward 1.0 → first reward becomes baseline
Action 1 gets reward 0.5 → below current baseline → preference decreases
Action 0 gets reward 1.5 → above baseline → preference increases further
Action 0 gets reward 2.0 → still above accumulated baseline → preference increases more

After these updates, action 0's preference has grown relative to action 1, resulting in softmax probabilities of approximately [0.57, 0.43]. The higher learning rate causes faster convergence toward favoring the consistently higher-rewarding action.

The Core Concept

$$\pi(a) = \frac{e^{H(a)}}{\sum_{b=1}^{k} e^{H(b)}}$$

Where:

H(a) represents the learned preference for action a
π(a) is the probability of selecting action a
k is the total number of available actions

The softmax function ensures that actions with higher preferences are selected more frequently, while still maintaining exploration by never assigning zero probability to any action.

The Learning Mechanism

$$H_{t+1}(A_t) = H_t(A_t) + \alpha \cdot (R_t - \bar{R}_t) \cdot (1 - \pi_t(A_t))$$

$$H_{t+1}(a) = H_t(a) - \alpha \cdot (R_t - \bar{R}_t) \cdot \pi_t(a) \quad \text{for all } a eq A_t$$

Where:

α is the learning rate controlling the step size of updates
Rₜ is the reward received at time t
R̄ₜ is the baseline (typically the average of all rewards received so far)
Aₜ is the action taken at time t
πₜ(a) is the probability of action a at time t

Your Task

Implement a PreferenceActionSelector class that:

Initializes with a specified number of actions and learning rate, starting all preferences at 0
Computes softmax probabilities over the current preferences (rounded to 2 decimal places)
Selects actions by sampling from the softmax probability distribution
Updates preferences using the gradient-based rule after receiving reward feedback

The class should track a running baseline as the incremental mean of all rewards received, updating it with each new reward observation.

Starting with 3 actions and all preferences at 0, the initial softmax gives equal probabilities of 0.333... for each action.

Starting with 4 actions and equal preferences:

Action 0 gets reward 2.0 → baseline = 2.0, advantage = 0 → no significant update
Action 1 gets reward -1.0 → baseline = 0.5, advantage = -1.5 → action 1 preference decreases
Action 2 gets reward 3.0 → baseline ≈ 1.33, advantage ≈ 1.67 → action 2 preference increases
Action 0 gets reward 1.5 → baseline ≈ 1.375, advantage ≈ 0.125 → small increase for action 0

The accumulated updates result in action 2 having the highest probability (0.29), action 1 the lowest (0.21), and actions 0 and 3 at the baseline (0.25).

With a higher learning rate (α = 0.2), preference updates are more aggressive:

Action 0 gets reward 1.0 → first reward becomes baseline
Action 1 gets reward 0.5 → below current baseline → preference decreases
Action 0 gets reward 1.5 → above baseline → preference increases further
Action 0 gets reward 2.0 → still above accumulated baseline → preference increases more

Preference-Based Action Selector with Softmax Probabilities

The Core Concept

The Learning Mechanism

Your Task

Hints

Preference-Based Action Selector with Softmax Probabilities

The Core Concept

The Learning Mechanism

Your Task

Hints