Self-Attention Weighted Aggregation (Medium) — Practice with Code Visualizer

The self-attention mechanism is a revolutionary concept in deep learning that allows each element in a sequence to dynamically attend to every other element, learning contextual relationships regardless of their positional distance. This mechanism forms the backbone of Transformer architectures that power modern AI systems like GPT, BERT, and Vision Transformers.

The Core Intuition

Imagine you have a sequence of N elements, each represented by a numeric value. The self-attention mechanism asks a fundamental question: "How much should each element in the sequence pay attention to every other element?"

For each position in the sequence, self-attention:

Computes a relevance score with every other position (including itself)
Normalizes these scores into a probability distribution using softmax
Produces an output as a weighted sum of all values, where the weights are the attention probabilities

Mathematical Formulation

Given a sequence of values ( V = [v_1, v_2, ..., v_N] ) and a scaling dimension ( d ), the self-attention output for each position ( i ) is computed as follows:

Step 1: Compute Attention Scores

For each pair of positions ( (i, j) ), compute the raw attention score:

$$\text{score}_{ij} = \frac{v_i \cdot v_j}{\sqrt{d}}$$

The division by ( \sqrt{d} ) is called scaled dot-product attention — it prevents the dot products from growing too large in magnitude, which would push the softmax into regions with extremely small gradients.

Step 2: Apply Softmax Normalization

Convert the raw scores into attention weights using the softmax function:

$$\alpha_{ij} = \frac{e^{\text{score}{ij}}}{\sum{k=1}^{N} e^{\text{score}_{ik}}}$$

The attention weights ( \alpha_{ij} ) form a probability distribution — they sum to 1 and indicate how much position ( i ) should attend to position ( j ).

Step 3: Compute Weighted Output

The final output for position ( i ) is the weighted sum of all values:

$$\text{output}i = \sum{j=1}^{N} \alpha_{ij} \cdot v_j$$

Your Task

Implement the self-attention mechanism that:

Takes a sequence length ( N ), a list of numeric values, and a dimension parameter ( d )
Computes attention scores using scaled dot-product attention
Applies softmax normalization to obtain attention weights
Returns the weighted output for each position, rounded to 4 decimal places as strings

Implementation Notes

Use numerically stable softmax by subtracting the maximum score before exponentiation
Round each output value to 4 decimal places
Return results as a list of strings (e.g., ["8.9993", "8.9638", "9.0"])

Step-by-step calculation for position 0 (value = 4):

Compute attention scores with all positions:
- score(0,0) = (4 × 4) / √1 = 16
- score(0,1) = (4 × 2) / √1 = 8
- score(0,2) = (4 × 7) / √1 = 28
- score(0,3) = (4 × 1) / √1 = 4
- score(0,4) = (4 × 9) / √1 = 36
Apply softmax: The exponentials of high scores (28, 36) dominate, so attention weights heavily favor positions 2 (value=7) and 4 (value=9).
Weighted sum: Since most attention weight goes to position 4 (value=9), the output for position 0 is approximately 8.9993.

Pattern observed: Positions with larger values attract more attention due to higher dot products, causing outputs to gravitate toward the maximum value in the sequence (9 in this case).

With values [1, 2, 3] and dimension 1:

For position 0 (value = 1):

Attention scores: [1, 2, 3]
After softmax, weight distribution favors value 3
Output ≈ 2.5752 (weighted toward higher values)

For position 1 (value = 2):

Attention scores: [2, 4, 6]
Stronger preference for value 3
Output ≈ 2.8509

For position 2 (value = 3):

Attention scores: [3, 6, 9]
Very strong preference for itself (highest value)
Output ≈ 2.948 (closest to max value 3)

Notice how larger values produce outputs closer to the maximum, demonstrating the "rich get richer" property of softmax attention.

With a larger dimension parameter (d=4), attention scores are scaled down by √4 = 2:

Scaling effect demonstration:

Without scaling: score(3,3) = 8 × 8 = 64
With scaling: score(3,3) = 64 / 2 = 32

The scaling by √d moderates the attention distribution, preventing it from becoming too peaked. Even with scaling:

Position 0 (value=2): Output ≈ 7.6896
Position 3 (value=8): Output ≈ 7.9993

The progression from 7.69 to 7.99 shows how the attention mechanism smoothly interpolates, with higher-valued positions attending more strongly to the maximum value (8).

The Core Intuition

For each position in the sequence, self-attention:

Computes a relevance score with every other position (including itself)
Normalizes these scores into a probability distribution using softmax
Produces an output as a weighted sum of all values, where the weights are the attention probabilities

Mathematical Formulation

Given a sequence of values ( V = [v_1, v_2, ..., v_N] ) and a scaling dimension ( d ), the self-attention output for each position ( i ) is computed as follows:

Step 1: Compute Attention Scores

For each pair of positions ( (i, j) ), compute the raw attention score:

$$\text{score}_{ij} = \frac{v_i \cdot v_j}{\sqrt{d}}$$

Step 2: Apply Softmax Normalization

Convert the raw scores into attention weights using the softmax function:

$$\alpha_{ij} = \frac{e^{\text{score}{ij}}}{\sum{k=1}^{N} e^{\text{score}_{ik}}}$$

The attention weights ( \alpha_{ij} ) form a probability distribution — they sum to 1 and indicate how much position ( i ) should attend to position ( j ).

Step 3: Compute Weighted Output

The final output for position ( i ) is the weighted sum of all values:

$$\text{output}i = \sum{j=1}^{N} \alpha_{ij} \cdot v_j$$

Your Task

Implement the self-attention mechanism that:

Takes a sequence length ( N ), a list of numeric values, and a dimension parameter ( d )
Computes attention scores using scaled dot-product attention
Applies softmax normalization to obtain attention weights
Returns the weighted output for each position, rounded to 4 decimal places as strings

Implementation Notes

Use numerically stable softmax by subtracting the maximum score before exponentiation
Round each output value to 4 decimal places
Return results as a list of strings (e.g., ["8.9993", "8.9638", "9.0"])

Step-by-step calculation for position 0 (value = 4):

Compute attention scores with all positions:
- score(0,0) = (4 × 4) / √1 = 16
- score(0,1) = (4 × 2) / √1 = 8
- score(0,2) = (4 × 7) / √1 = 28
- score(0,3) = (4 × 1) / √1 = 4
- score(0,4) = (4 × 9) / √1 = 36
Apply softmax: The exponentials of high scores (28, 36) dominate, so attention weights heavily favor positions 2 (value=7) and 4 (value=9).
Weighted sum: Since most attention weight goes to position 4 (value=9), the output for position 0 is approximately 8.9993.

Pattern observed: Positions with larger values attract more attention due to higher dot products, causing outputs to gravitate toward the maximum value in the sequence (9 in this case).

With values [1, 2, 3] and dimension 1:

For position 0 (value = 1):

Attention scores: [1, 2, 3]
After softmax, weight distribution favors value 3
Output ≈ 2.5752 (weighted toward higher values)

For position 1 (value = 2):

Attention scores: [2, 4, 6]
Stronger preference for value 3
Output ≈ 2.8509

For position 2 (value = 3):

Attention scores: [3, 6, 9]
Very strong preference for itself (highest value)
Output ≈ 2.948 (closest to max value 3)

Notice how larger values produce outputs closer to the maximum, demonstrating the "rich get richer" property of softmax attention.

With a larger dimension parameter (d=4), attention scores are scaled down by √4 = 2:

Scaling effect demonstration:

Without scaling: score(3,3) = 8 × 8 = 64
With scaling: score(3,3) = 64 / 2 = 32

The scaling by √d moderates the attention distribution, preventing it from becoming too peaked. Even with scaling:

Position 0 (value=2): Output ≈ 7.6896
Position 3 (value=8): Output ≈ 7.9993

The progression from 7.69 to 7.99 shows how the attention mechanism smoothly interpolates, with higher-valued positions attending more strongly to the maximum value (8).

Self-Attention Weighted Aggregation

The Core Intuition

Mathematical Formulation

Step 1: Compute Attention Scores

Step 2: Apply Softmax Normalization

Step 3: Compute Weighted Output

Your Task

Implementation Notes

Hints

Self-Attention Weighted Aggregation

The Core Intuition

Mathematical Formulation

Step 1: Compute Attention Scores

Step 2: Apply Softmax Normalization

Step 3: Compute Weighted Output

Your Task

Implementation Notes

Hints