Loading problem...
Modern language models have revolutionized natural language processing through autoregressive text generation—a paradigm where text is produced one token at a time, with each new token conditioned on all previously generated tokens. This approach forms the foundation of breakthrough systems like conversational AI assistants, code generation tools, and creative writing applications.
In this challenge, you will implement a simplified transformer-based text synthesis engine that captures the essential components of models like GPT-2. Your implementation will process an input prompt and autoregressively generate new tokens by leveraging the following architectural components:
Each token in the vocabulary is mapped to a dense vector representation of fixed dimensionality. This embedding lookup transforms discrete tokens into a continuous vector space where semantic relationships can be captured.
$$\text{TokenEmb}(t) = \mathbf{W}_{te}[t]$$
Since the transformer architecture processes all positions in parallel, positional information must be explicitly injected. Learned positional embeddings encode the absolute position of each token in the sequence:
$$\text{PosEmb}(i) = \mathbf{W}_{pe}[i]$$
The input representation combines both: $\mathbf{h}_0 = \text{TokenEmb}(t) + \text{PosEmb}(i)$
The attention mechanism allows each token to dynamically attend to relevant positions in the sequence. For autoregressive generation, causal masking ensures that tokens can only attend to previous positions (including themselves), preventing information leakage from future tokens.
After attention, each position is processed independently through a two-layer feed-forward network with a non-linear activation (typically GELU):
$$\text{FFN}(\mathbf{x}) = \text{GELU}(\mathbf{x}\mathbf{W}_1 + \mathbf{b}_1)\mathbf{W}_2 + \mathbf{b}_2$$
Layer normalization is applied before each sub-layer (pre-norm architecture) to stabilize training and generation:
$$\text{LayerNorm}(\mathbf{x}) = \gamma \odot \frac{\mathbf{x} - \mu}{\sigma} + \beta$$
The final hidden states are projected back to vocabulary logits using a linear transformation. The next token is selected by taking the argmax over the softmax of these logits.
Your function should implement the following autoregressive loop:
Use the provided load_encoder_hparams_and_params() helper function to obtain:
prompt = "hello"
n_tokens_to_generate = 5"hello world world world <UNK> <UNK>"The input prompt "hello" is tokenized into a single token. The simplified transformer model processes this input and generates 5 new tokens autoregressively:
Initial encoding: "hello" → token ID 0
Generation loop:
Final decoding: [0, 1, 1, 1, 3, 3] → "hello world world world <UNK> <UNK>"
The generation pattern reflects the simplified model's random weights and limited vocabulary.
prompt = "hello world"
n_tokens_to_generate = 3"hello world world world <UNK>"The prompt contains two tokens that are tokenized separately:
Tokenization: "hello world" → [0, 1] (two tokens)
Generation loop:
Final output: The original prompt plus 3 generated tokens yields "hello world world world <UNK>"
The model shows repetition of the dominant token before transitioning to <UNK> as context length grows.
prompt = "the"
n_tokens_to_generate = 1"the world"A minimal generation example:
This demonstrates the base case of the autoregressive loop.
Constraints