Autoregressive Text Synthesis Engine (Hard) — Practice with Code Visualizer

Modern language models have revolutionized natural language processing through autoregressive text generation—a paradigm where text is produced one token at a time, with each new token conditioned on all previously generated tokens. This approach forms the foundation of breakthrough systems like conversational AI assistants, code generation tools, and creative writing applications.

In this challenge, you will implement a simplified transformer-based text synthesis engine that captures the essential components of models like GPT-2. Your implementation will process an input prompt and autoregressively generate new tokens by leveraging the following architectural components:

Core Components

1. Token Embedding Layer

Each token in the vocabulary is mapped to a dense vector representation of fixed dimensionality. This embedding lookup transforms discrete tokens into a continuous vector space where semantic relationships can be captured.

$$\text{TokenEmb}(t) = \mathbf{W}_{te}[t]$$

2. Positional Encoding

Since the transformer architecture processes all positions in parallel, positional information must be explicitly injected. Learned positional embeddings encode the absolute position of each token in the sequence:

$$\text{PosEmb}(i) = \mathbf{W}_{pe}[i]$$

The input representation combines both: $\mathbf{h}_0 = \text{TokenEmb}(t) + \text{PosEmb}(i)$

3. Multi-Head Self-Attention

The attention mechanism allows each token to dynamically attend to relevant positions in the sequence. For autoregressive generation, causal masking ensures that tokens can only attend to previous positions (including themselves), preventing information leakage from future tokens.

4. Position-wise Feed-Forward Network

After attention, each position is processed independently through a two-layer feed-forward network with a non-linear activation (typically GELU):

$$\text{FFN}(\mathbf{x}) = \text{GELU}(\mathbf{x}\mathbf{W}_1 + \mathbf{b}_1)\mathbf{W}_2 + \mathbf{b}_2$$

5. Layer Normalization

Layer normalization is applied before each sub-layer (pre-norm architecture) to stabilize training and generation:

$$\text{LayerNorm}(\mathbf{x}) = \gamma \odot \frac{\mathbf{x} - \mu}{\sigma} + \beta$$

6. Output Projection (LM Head)

The final hidden states are projected back to vocabulary logits using a linear transformation. The next token is selected by taking the argmax over the softmax of these logits.

Generation Process

Your function should implement the following autoregressive loop:

Encode the prompt into token IDs using the provided encoder
For each new token to generate:
- Compute the forward pass through the transformer
- Extract logits for the last position only
- Select the token with the highest probability (greedy decoding)
- Append the new token to the sequence
Decode the complete sequence (original + generated) back to text

Use the provided load_encoder_hparams_and_params() helper function to obtain:

A dummy encoder mapping tokens to integer IDs
Model hyperparameters (embedding dimension, number of heads, etc.)
Pre-initialized model parameters (weight matrices, layer norm parameters, etc.)

The input prompt "hello" is tokenized into a single token. The simplified transformer model processes this input and generates 5 new tokens autoregressively:

Initial encoding: "hello" → token ID 0
Generation loop:
- Position 1: Forward pass outputs highest probability for token 1 ("world")
- Position 2: With context ["hello", "world"], outputs token 1 ("world") again
- Position 3: Continues pattern, predicting "world"
- Position 4: Model uncertainty increases, predicts <UNK> (token 3)
- Position 5: Predicts <UNK> again
Final decoding: [0, 1, 1, 1, 3, 3] → "hello world world world <UNK> <UNK>"

The generation pattern reflects the simplified model's random weights and limited vocabulary.

The prompt contains two tokens that are tokenized separately:

Tokenization: "hello world" → [0, 1] (two tokens)
Generation loop:
- Position 2: Given context ["hello", "world"], predicts "world" (token 1)
- Position 3: Given ["hello", "world", "world"], predicts "world" again
- Position 4: Given ["hello", "world", "world", "world"], predicts <UNK> (token 3)
Final output: The original prompt plus 3 generated tokens yields "hello world world world <UNK>"

The model shows repetition of the dominant token before transitioning to <UNK> as context length grows.

A minimal generation example:

Tokenization: "the" → token ID 2
Single forward pass: The model processes the single input token and predicts the next token with highest probability
Greedy selection: Token 1 ("world") has the highest logit value
Output: Original prompt + 1 generated token = "the world"

This demonstrates the base case of the autoregressive loop.

Core Components

1. Token Embedding Layer

$$\text{TokenEmb}(t) = \mathbf{W}_{te}[t]$$

2. Positional Encoding

$$\text{PosEmb}(i) = \mathbf{W}_{pe}[i]$$

The input representation combines both: $\mathbf{h}_0 = \text{TokenEmb}(t) + \text{PosEmb}(i)$

3. Multi-Head Self-Attention

4. Position-wise Feed-Forward Network

After attention, each position is processed independently through a two-layer feed-forward network with a non-linear activation (typically GELU):

$$\text{FFN}(\mathbf{x}) = \text{GELU}(\mathbf{x}\mathbf{W}_1 + \mathbf{b}_1)\mathbf{W}_2 + \mathbf{b}_2$$

5. Layer Normalization

Layer normalization is applied before each sub-layer (pre-norm architecture) to stabilize training and generation:

$$\text{LayerNorm}(\mathbf{x}) = \gamma \odot \frac{\mathbf{x} - \mu}{\sigma} + \beta$$

6. Output Projection (LM Head)

The final hidden states are projected back to vocabulary logits using a linear transformation. The next token is selected by taking the argmax over the softmax of these logits.

Generation Process

Your function should implement the following autoregressive loop:

Encode the prompt into token IDs using the provided encoder
For each new token to generate:
- Compute the forward pass through the transformer
- Extract logits for the last position only
- Select the token with the highest probability (greedy decoding)
- Append the new token to the sequence
Decode the complete sequence (original + generated) back to text

Use the provided load_encoder_hparams_and_params() helper function to obtain:

A dummy encoder mapping tokens to integer IDs
Model hyperparameters (embedding dimension, number of heads, etc.)
Pre-initialized model parameters (weight matrices, layer norm parameters, etc.)

The input prompt "hello" is tokenized into a single token. The simplified transformer model processes this input and generates 5 new tokens autoregressively:

Initial encoding: "hello" → token ID 0
Generation loop:
- Position 1: Forward pass outputs highest probability for token 1 ("world")
- Position 2: With context ["hello", "world"], outputs token 1 ("world") again
- Position 3: Continues pattern, predicting "world"
- Position 4: Model uncertainty increases, predicts <UNK> (token 3)
- Position 5: Predicts <UNK> again
Final decoding: [0, 1, 1, 1, 3, 3] → "hello world world world <UNK> <UNK>"

The generation pattern reflects the simplified model's random weights and limited vocabulary.

The prompt contains two tokens that are tokenized separately:

Tokenization: "hello world" → [0, 1] (two tokens)
Generation loop:
- Position 2: Given context ["hello", "world"], predicts "world" (token 1)
- Position 3: Given ["hello", "world", "world"], predicts "world" again
- Position 4: Given ["hello", "world", "world", "world"], predicts <UNK> (token 3)
Final output: The original prompt plus 3 generated tokens yields "hello world world world <UNK>"

The model shows repetition of the dominant token before transitioning to <UNK> as context length grows.

A minimal generation example:

Tokenization: "the" → token ID 2
Single forward pass: The model processes the single input token and predicts the next token with highest probability
Greedy selection: Token 1 ("world") has the highest logit value
Output: Original prompt + 1 generated token = "the world"

This demonstrates the base case of the autoregressive loop.

Autoregressive Text Synthesis Engine

Core Components

1. Token Embedding Layer

2. Positional Encoding

3. Multi-Head Self-Attention

4. Position-wise Feed-Forward Network

5. Layer Normalization

6. Output Projection (LM Head)

Generation Process

Hints

Autoregressive Text Synthesis Engine

Core Components

1. Token Embedding Layer

2. Positional Encoding

3. Multi-Head Self-Attention

4. Position-wise Feed-Forward Network

5. Layer Normalization

6. Output Projection (LM Head)

Generation Process

Hints