Transformer Position Embedding Generator (Hard) — Practice with Code Visualizer

In modern sequence-to-sequence models like Transformers, understanding the order of tokens in a sequence is crucial for tasks like language understanding, translation, and generation. Unlike recurrent networks that inherently process sequences step-by-step, Transformers process all tokens simultaneously through self-attention mechanisms. This creates a fundamental challenge: how does the model know which position each token occupies in the sequence?

The solution is sinusoidal position embeddings—a deterministic encoding scheme that injects positional information into the model without requiring any learned parameters. This technique was introduced in the seminal "Attention Is All You Need" paper and remains foundational to understanding modern AI architectures.

The Core Idea

Position embeddings assign a unique, continuous vector to each position in a sequence. The sinusoidal approach uses sine and cosine functions at varying frequencies to create these vectors. The mathematical elegance lies in the fact that:

Each position gets a unique encoding that the model can distinguish
The model can extrapolate to longer sequences than seen during training
Relative positions can be computed through linear transformations of the embeddings

Mathematical Formulation

For a position pos in the sequence and dimension index i within the embedding vector of size d_model, the encoding is computed as:

$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$

$$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$

Where:

pos ranges from 0 to sequence_length - 1
i ranges from 0 to d_model/2 - 1
Even indices (2i) use the sine function
Odd indices (2i+1) use the cosine function
The denominator creates wavelengths ranging from 2π to 10000·2π

Your Task

Implement a function that generates sinusoidal position embeddings for a given sequence length and embedding dimensionality. Your function should:

Return a NumPy array of shape (1, sequence_length, embedding_dim) with dtype float16
For each position, compute sine values at even indices and cosine values at odd indices
Return -1 if sequence_length is 0 or if embedding_dim is less than or equal to 0

This implementation forms the backbone of positional understanding in transformer-based models and is essential knowledge for anyone working with modern NLP and AI systems.

For an 8-dimensional embedding with 2 positions:

Position 0 (all zeros for sine, all ones for cosine):

sin(0) = 0.0 for all even indices (0, 2, 4, 6)
cos(0) = 1.0 for all odd indices (1, 3, 5, 7)
Result: [0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0]

Position 1 (varying frequencies):

Index 0: sin(1/10000^(0/8)) = sin(1) ≈ 0.841
Index 1: cos(1/10000^(0/8)) = cos(1) ≈ 0.540
Index 2: sin(1/10000^(2/8)) = sin(0.1) ≈ 0.0998
Index 3: cos(1/10000^(2/8)) = cos(0.1) ≈ 0.995
And so on with decreasing frequencies...

The output shape is (1, 2, 8), representing a batch of 1 with 2 positions and 8 embedding dimensions.

For a 4-dimensional embedding with 3 positions:

Position 0: [sin(0), cos(0), sin(0), cos(0)] = [0.0, 1.0, 0.0, 1.0]

Position 1:

Indices 0,1 use wavelength 2π: sin(1) ≈ 0.841, cos(1) ≈ 0.540
Indices 2,3 use wavelength 2π·100: sin(0.01) ≈ 0.01, cos(0.01) ≈ 1.0

Position 2:

Indices 0,1: sin(2) ≈ 0.909, cos(2) ≈ -0.416
Indices 2,3: sin(0.02) ≈ 0.02, cos(0.02) ≈ 1.0

Notice how higher frequency components (indices 0,1) change rapidly between positions while lower frequency components (indices 2,3) change slowly. This multi-scale encoding helps the model distinguish both nearby and distant positions.

When sequence_length is 0, there are no positions to encode. The function returns -1 to indicate invalid input, as generating embeddings for zero positions is meaningless and undefined.

The Core Idea

Each position gets a unique encoding that the model can distinguish
The model can extrapolate to longer sequences than seen during training
Relative positions can be computed through linear transformations of the embeddings

Mathematical Formulation

For a position pos in the sequence and dimension index i within the embedding vector of size d_model, the encoding is computed as:

$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$

$$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$

Where:

pos ranges from 0 to sequence_length - 1
i ranges from 0 to d_model/2 - 1
Even indices (2i) use the sine function
Odd indices (2i+1) use the cosine function
The denominator creates wavelengths ranging from 2π to 10000·2π

Your Task

Implement a function that generates sinusoidal position embeddings for a given sequence length and embedding dimensionality. Your function should:

Return a NumPy array of shape (1, sequence_length, embedding_dim) with dtype float16
For each position, compute sine values at even indices and cosine values at odd indices
Return -1 if sequence_length is 0 or if embedding_dim is less than or equal to 0

This implementation forms the backbone of positional understanding in transformer-based models and is essential knowledge for anyone working with modern NLP and AI systems.

For an 8-dimensional embedding with 2 positions:

Position 0 (all zeros for sine, all ones for cosine):

sin(0) = 0.0 for all even indices (0, 2, 4, 6)
cos(0) = 1.0 for all odd indices (1, 3, 5, 7)
Result: [0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0]

Position 1 (varying frequencies):

Index 0: sin(1/10000^(0/8)) = sin(1) ≈ 0.841
Index 1: cos(1/10000^(0/8)) = cos(1) ≈ 0.540
Index 2: sin(1/10000^(2/8)) = sin(0.1) ≈ 0.0998
Index 3: cos(1/10000^(2/8)) = cos(0.1) ≈ 0.995
And so on with decreasing frequencies...

The output shape is (1, 2, 8), representing a batch of 1 with 2 positions and 8 embedding dimensions.

For a 4-dimensional embedding with 3 positions:

Position 0: [sin(0), cos(0), sin(0), cos(0)] = [0.0, 1.0, 0.0, 1.0]

Position 1:

Indices 0,1 use wavelength 2π: sin(1) ≈ 0.841, cos(1) ≈ 0.540
Indices 2,3 use wavelength 2π·100: sin(0.01) ≈ 0.01, cos(0.01) ≈ 1.0

Position 2:

Indices 0,1: sin(2) ≈ 0.909, cos(2) ≈ -0.416
Indices 2,3: sin(0.02) ≈ 0.02, cos(0.02) ≈ 1.0

When sequence_length is 0, there are no positions to encode. The function returns -1 to indicate invalid input, as generating embeddings for zero positions is meaningless and undefined.

Transformer Position Embedding Generator

The Core Idea

Mathematical Formulation

Your Task

Hints

Transformer Position Embedding Generator

The Core Idea

Mathematical Formulation

Your Task

Hints