Loading problem...
The self-attention mechanism is a revolutionary concept in deep learning that allows each element in a sequence to dynamically attend to every other element, learning contextual relationships regardless of their positional distance. This mechanism forms the backbone of Transformer architectures that power modern AI systems like GPT, BERT, and Vision Transformers.
Imagine you have a sequence of N elements, each represented by a numeric value. The self-attention mechanism asks a fundamental question: "How much should each element in the sequence pay attention to every other element?"
For each position in the sequence, self-attention:
Given a sequence of values ( V = [v_1, v_2, ..., v_N] ) and a scaling dimension ( d ), the self-attention output for each position ( i ) is computed as follows:
For each pair of positions ( (i, j) ), compute the raw attention score:
$$\text{score}_{ij} = \frac{v_i \cdot v_j}{\sqrt{d}}$$
The division by ( \sqrt{d} ) is called scaled dot-product attention — it prevents the dot products from growing too large in magnitude, which would push the softmax into regions with extremely small gradients.
Convert the raw scores into attention weights using the softmax function:
$$\alpha_{ij} = \frac{e^{\text{score}{ij}}}{\sum{k=1}^{N} e^{\text{score}_{ik}}}$$
The attention weights ( \alpha_{ij} ) form a probability distribution — they sum to 1 and indicate how much position ( i ) should attend to position ( j ).
The final output for position ( i ) is the weighted sum of all values:
$$\text{output}i = \sum{j=1}^{N} \alpha_{ij} \cdot v_j$$
Implement the self-attention mechanism that:
["8.9993", "8.9638", "9.0"])sequence_length = 5
values = [4, 2, 7, 1, 9]
dimension = 1["8.9993", "8.9638", "9.0", "8.7259", "9.0"]Step-by-step calculation for position 0 (value = 4):
Compute attention scores with all positions:
Apply softmax: The exponentials of high scores (28, 36) dominate, so attention weights heavily favor positions 2 (value=7) and 4 (value=9).
Weighted sum: Since most attention weight goes to position 4 (value=9), the output for position 0 is approximately 8.9993.
Pattern observed: Positions with larger values attract more attention due to higher dot products, causing outputs to gravitate toward the maximum value in the sequence (9 in this case).
sequence_length = 3
values = [1, 2, 3]
dimension = 1["2.5752", "2.8509", "2.948"]With values [1, 2, 3] and dimension 1:
For position 0 (value = 1):
For position 1 (value = 2):
For position 2 (value = 3):
Notice how larger values produce outputs closer to the maximum, demonstrating the "rich get richer" property of softmax attention.
sequence_length = 4
values = [2, 4, 6, 8]
dimension = 4["7.6896", "7.9627", "7.995", "7.9993"]With a larger dimension parameter (d=4), attention scores are scaled down by √4 = 2:
Scaling effect demonstration:
The scaling by √d moderates the attention distribution, preventing it from becoming too peaked. Even with scaling:
The progression from 7.69 to 7.99 shows how the attention mechanism smoothly interpolates, with higher-valued positions attending more strongly to the maximum value (8).
Constraints