Loading content...
A base language model trained purely on next-token prediction is a powerful engine, but it behaves like an autocomplete system: give it "The capital of France is" and it will output "Paris", but ask it "What is the capital of France?" and it may continue with "A question for today's geography lesson..." rather than answering.
This is the instruction-following gap. Base models learn to predict text, but not to follow instructions. They lack the alignment between user intent and model behavior that makes an LLM useful as an assistant, coding tool, or reasoning engine.
Instruction tuning bridges this gap. Through supervised fine-tuning on instruction-response pairs, we teach models to recognize instructions and respond appropriately. This seemingly simple step—showing the model examples of helpful responses—produces the dramatic behavioral shift that transforms a language model into ChatGPT.
This page covers the complete instruction tuning pipeline: the motivation and mechanics, instruction dataset creation, supervised fine-tuning techniques, and the challenges of creating genuinely helpful instruction-following models. You will understand how to transform any capable base model into an effective assistant.
Base language models are trained to model the distribution of text on the internet. This creates a fundamental misalignment with how we want to use them:
Given the prompt "Write a poem about autumn", a base model predicts what typically follows such text in its training data. This might be:
The base model doesn't "understand" that you want a poem—it predicts tokens. Sometimes prediction and instruction-following coincide; often they don't.
Given the dramatic behavioral change, one might expect instruction tuning to require retraining from scratch. Remarkably, the opposite is true:
| Training Phase | Compute (GPT-3 scale) | Tokens | Effect |
|---|---|---|---|
| Pre-training | ~$5M (3×10²³ FLOPS) | 300B | Acquire knowledge and capabilities |
| Instruction Tuning | ~$100K (<1% of pre-training) | 10-100M | Align behavior to instructions |
Key insight: Instruction tuning doesn't teach new knowledge—it teaches the model to access and apply knowledge it already has. The capability was latent in the base model; instruction tuning unlocks it.
This efficiency has profound implications:
Current evidence suggests that instruction tuning primarily teaches the model a format and style—the 'surface' of being helpful—rather than fundamentally changing its underlying capabilities. The base model's knowledge and reasoning ability largely determine the ceiling; instruction tuning teaches the model to consistently reach that ceiling.
The quality and diversity of instruction data fundamentally determines the resulting model's capabilities. Creating effective instruction datasets is both art and science.
Instruction examples typically follow a structured format:
{
"instruction": "Write a Python function that calculates factorial",
"input": "(optional) Additional context or parameters",
"output": "def factorial(n):
if n <= 1:
return 1
return n * factorial(n - 1)"
}
The model learns to map (instruction, input) pairs to appropriate outputs. During inference, users provide the instruction/input, and the model generates the output.
| Category | Examples | Purpose |
|---|---|---|
| Open-ended Generation | Write a story, compose a poem, brainstorm ideas | Creative capability, fluent generation |
| Closed QA | What is the capital of France? | Factual recall, concise answering |
| Open QA | Explain why the sky is blue | Explanation, teaching, reasoning shown |
| Task Completion | Summarize this text, translate this, extract entities | Following specific procedures |
| Conversation | Multi-turn dialogues with context | Coherent multi-turn interaction |
| Math/Reasoning | Solve 2x + 5 = 15, logic puzzles | Step-by-step reasoning |
| Coding | Write/debug/explain code | Programming capability |
| Refusal | How to harm someone → I can't help with that | Safety alignment |
1. Human-Written Data
The highest quality but most expensive approach. Human annotators create both instructions and responses.
Pros: High quality, nuanced, authentic human intent Cons: Expensive ($10-50 per high-quality example), slow to scale
2. Template-Based Generation
Define templates and fill with variations:
Templates:
- "Explain {concept} as if I were {age_group}"
- "Compare and contrast {A} and {B}"
- "Write a {length} {genre} story about {topic}"
Pros: Scalable, controllable diversity Cons: May lack naturalness, template artifacts
3. Model-Generated Data (Self-Instruct)
Use an existing LLM to generate instruction-response pairs:
Pros: Highly scalable, leverages model capability Cons: May amplify model biases, quality ceiling limited by generator
4. Distillation from Stronger Models
Collect outputs from GPT-4 or Claude to train smaller models:
Pros: High-quality outputs, scalable Cons: Legal concerns (ToS violations), capability limited by source
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
import openaiimport random class SelfInstructGenerator: """ Generate instruction-tuning data using the Self-Instruct method. """ def __init__(self, seed_instructions: list[dict], model: str = "gpt-4"): self.seed_instructions = seed_instructions self.generated_instructions = [] self.model = model def generate_new_instruction(self) -> dict | None: """ Generate a new instruction inspired by existing ones. """ # Sample few-shot examples examples = random.sample( self.seed_instructions + self.generated_instructions[-100:], min(8, len(self.seed_instructions)) ) prompt = self._build_generation_prompt(examples) response = openai.ChatCompletion.create( model=self.model, messages=[ {"role": "system", "content": "You are a creative assistant."}, {"role": "user", "content": prompt} ], temperature=0.8, # Higher temp for diversity ) instruction = self._parse_instruction(response) # Quality filters if not self._passes_quality_check(instruction): return None return instruction def _passes_quality_check(self, instruction: dict) -> bool: """ Filter low-quality or duplicate instructions. """ text = instruction.get("instruction", "") # Length constraints if len(text.split()) < 5 or len(text.split()) > 100: return False # Diversity check (ROUGE-L against existing) for existing in self.generated_instructions[-1000:]: if self._rouge_l(text, existing["instruction"]) > 0.7: return False return TrueResearch consistently shows that diversity and quality matter more than sheer volume. 10,000 high-quality, diverse instructions often outperform 1 million noisy examples. Invest in curation and filtering.
Supervised Fine-Tuning (SFT) is the process of continuing training on instruction-response pairs. While conceptually simple, effective SFT requires careful attention to formatting, hyperparameters, and training dynamics.
Like pre-training, SFT uses cross-entropy loss on next-token prediction. The key difference is which tokens are included in the loss:
Conversation:
[System: You are a helpful assistant]
[User: What is 2+2?]
[Assistant: The answer is 4.]
Loss Masking:
[System: ...] → Loss = 0 (not predicting system prompt)
[User: ...] → Loss = 0 (not predicting user input)
[Assistant: The answer is 4.] → Loss computed here only
The model only learns to predict the assistant's tokens, not to reconstruct the prompt. This focuses learning on response generation.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
import torchfrom transformers import AutoTokenizer, AutoModelForCausalLM def prepare_sft_batch(examples, tokenizer, max_length=2048): """ Prepare a batch for supervised fine-tuning with proper loss masking. """ input_ids_list = [] labels_list = [] for example in examples: # Format the conversation conversation = format_conversation( system=example.get("system", "You are a helpful assistant."), user=example["instruction"], assistant=example["output"] ) # Tokenize tokens = tokenizer.encode(conversation, add_special_tokens=True) # Find where assistant response starts assistant_start = find_assistant_start(tokens, tokenizer) # Create labels: -100 for prompt tokens (ignored in loss) labels = [-100] * assistant_start + tokens[assistant_start:] # Truncate if necessary if len(tokens) > max_length: tokens = tokens[:max_length] labels = labels[:max_length] input_ids_list.append(tokens) labels_list.append(labels) # Pad to same length input_ids = pad_sequences(input_ids_list, tokenizer.pad_token_id) labels = pad_sequences(labels_list, -100) return { "input_ids": torch.tensor(input_ids), "labels": torch.tensor(labels), "attention_mask": (torch.tensor(input_ids) != tokenizer.pad_token_id) } def format_conversation(system: str, user: str, assistant: str) -> str: """ Format using ChatML or similar template. """ return f"""<|im_start|>system{system}<|im_end|><|im_start|>user{user}<|im_end|><|im_start|>assistant{assistant}<|im_end|>""" # Typical SFT hyperparameterssft_config = { "learning_rate": 2e-5, # Lower than pre-training "batch_size": 128, # Smaller batches sufficient "epochs": 2-4, # Few epochs to avoid overfitting "warmup_ratio": 0.03, # Short warmup "weight_decay": 0.0, # Often disabled for fine-tuning "lr_scheduler": "cosine",}| Hyperparameter | Typical Range | Reasoning |
|---|---|---|
| Learning Rate | 1e-6 to 5e-5 | Too high → catastrophic forgetting |
| Epochs | 1-4 | More epochs → overfitting, style collapse |
| Batch Size | 64-256 | Smaller batches often work well |
| Context Length | 2048-8192 | Match expected deployment context |
| Weight Decay | 0.0-0.01 | Less regularization vs pre-training |
| LoRA Rank | 8-128 | If using parameter-efficient fine-tuning |
Common pitfalls:
Evaluating instruction-following models is challenging because "good response" is subjective and multidimensional. Multiple complementary approaches are necessary.
1. Benchmark Accuracy
Standard ML benchmarks measure capability:
| Benchmark | Measures | Format |
|---|---|---|
| MMLU | Knowledge across 57 subjects | Multiple choice |
| HellaSwag | Common-sense reasoning | Multiple choice |
| GSM8K | Math word problems | Free-form answer |
| HumanEval | Code generation | Functional correctness |
| TruthfulQA | Avoiding falsehoods | Multiple choice/free |
| MT-Bench | Multi-turn conversation | Free-form (LLM judged) |
2. LLM-as-Judge
Use a stronger model (GPT-4) to evaluate responses:
judge_prompt = """
Evaluate the following response to the given instruction.
Rate on a scale of 1-10 for:
- Helpfulness
- Accuracy
- Relevance
- Completeness
Instruction: {instruction}
Response: {response}
Provide ratings and brief justification.
"""
Advantages: Scalable, correlates with human judgment Disadvantages: Bias toward verbose responses, judge model limitations
3. Rule-Based Evaluation
For specific tasks, programmatic checks work well:
def evaluate_code_response(response: str, test_cases: list) -> dict:
"""Evaluate code generation by running tests."""
try:
exec(response) # Execute generated code
passed = sum(1 for tc in test_cases if run_test(tc))
return {"pass_rate": passed / len(test_cases)}
except Exception as e:
return {"pass_rate": 0, "error": str(e)}
def evaluate_format_following(response: str, expected_format: str) -> bool:
"""Check if response follows JSON/XML/etc format."""
try:
if expected_format == "json":
json.loads(response)
return True
except:
return False
Human evaluation remains the gold standard but is expensive and slow:
Pairwise Comparison (Arena-style):
Likert Scale Rating:
Preference Ranking:
No single metric captures 'assistant quality.' Real-world usefulness depends on task distribution (which tasks do users actually ask?), interaction patterns (multi-turn, clarification-seeking), and subjective preferences (verbosity, tone, format). Deploy with monitoring and iterate based on real usage.
| Context | Primary Evaluation | Secondary Evaluation |
|---|---|---|
| Research/Ablation | Benchmark accuracy (MMLU, etc.) | LLM-as-judge on sample |
| Pre-deployment | MT-Bench, human eval on key scenarios | Safety evaluation suite |
| Production | User retention, thumbs up/down ratio | A/B test against baseline |
| Safety Review | Red-teaming, adversarial prompts | Automated toxicity scoring |
Beyond basic SFT, several techniques improve instruction-following quality and efficiency.
Full fine-tuning modifies all parameters. PEFT methods update only a small fraction:
LoRA (Low-Rank Adaptation)
Instead of updating weight matrix $W$, learn low-rank update: $$W' = W + BA$$
Where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$, with rank $r \ll \min(d, k)$.
| Full Fine-tuning | LoRA (r=8) | Memory Savings |
|---|---|---|
| 7B parameters | ~17M trainable | 99.7% reduction |
| 140 GB (7B in bf16 + optimizer) | ~4 GB | 97% reduction |
QLoRA
Combines LoRA with 4-bit quantization:
1234567891011121314151617181920212223242526272829303132
from peft import LoraConfig, get_peft_modelfrom transformers import AutoModelForCausalLM # Load base modelmodel = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-hf", torch_dtype=torch.bfloat16, device_map="auto") # Configure LoRAlora_config = LoraConfig( r=16, # Rank of update matrices lora_alpha=32, # Scaling factor (often 2×r) target_modules=[ # Which layers to adapt "q_proj", "k_proj", "v_proj", "o_proj", # Attention "gate_proj", "up_proj", "down_proj", # MLP ], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM",) # Create PEFT modelmodel = get_peft_model(model, lora_config) # Print trainable parametersmodel.print_trainable_parameters()# Output: trainable params: 18,743,296 || all params: 6,756,392,960 # || trainable%: 0.277% # Training proceeds as normal, but only LoRA params are updatedA simple technique that improves SFT by adding noise to embeddings:
def neftune_forward(embeddings, noise_alpha=5.0):
"""Add noise to embeddings during training."""
if model.training:
# Compute noise scaling
dims = embeddings.shape[-1]
mag_norm = noise_alpha / (dims ** 0.5)
# Add uniform noise
noise = torch.zeros_like(embeddings).uniform_(-1, 1) * mag_norm
embeddings = embeddings + noise
return embeddings
This regularization reduces overfitting and improves generalization, especially on smaller datasets. Accuracy improvements of 10-15% on benchmarks have been reported.
Order training examples from simpler to more complex:
Phase 1: Simple Q&A, single-turn, short responses
Phase 2: Multi-step tasks, structured outputs
Phase 3: Complex reasoning, long-form generation
Phase 4: Multi-turn conversations, edge cases
Intuition: Establish basic patterns before tackling complexity.
Training on diverse tasks jointly improves generalization:
mixture = {
"general_qa": 0.25,
"coding": 0.20,
"reasoning": 0.15,
"creative_writing": 0.15,
"summarization": 0.10,
"translation": 0.10,
"safety_examples": 0.05,
}
# Sample each batch according to mixture weights
The mixture significantly affects model behavior. Oversampling coding produces a coding-focused model; oversampling safety produces a cautious one.
Use PEFT (LoRA/QLoRA) for: limited compute, experimentation, preserving base capabilities. Use full fine-tuning for: maximum quality, significant distribution shift, when compute is available. In practice, LoRA achieves 90-95% of full fine-tuning quality at a fraction of the cost.
Let's trace the complete journey from a base language model to a production assistant.
Choose a pre-trained model based on:
| Criterion | Considerations |
|---|---|
| Size | Larger = more capable, more expensive to run |
| License | Apache/MIT for commercial, LLaMA license, proprietary |
| Pre-training data | Recency, multilinguality, code content |
| Architecture | Standard transformer? Efficient variants? |
| Ecosystem | Quantization support, inference frameworks |
Popular choices: LLaMA 2/3, Mistral, Qwen, Gemma, Phi
# Typical training run
training_args = TrainingArguments(
output_dir="./sft-model",
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
learning_rate=2e-5,
num_train_epochs=3,
warmup_ratio=0.03,
logging_steps=10,
save_strategy="steps",
save_steps=200,
evaluation_strategy="steps",
eval_steps=200,
bf16=True,
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
tokenizer=tokenizer,
dataset_text_field="text",
max_seq_length=2048,
)
trainer.train()
Evaluation checklist:
Common iteration patterns:
| Observation | Likely Fix |
|---|---|
| Refuses to answer basic questions | Add more diverse Q&A examples |
| Responses too verbose | Add concise examples, reduce verbosity in data |
| Poor code quality | Add more code examples, higher quality code data |
| Format inconsistent | Ensure consistent template, add format examples |
| Hallucinating facts | Can't fix with SFT alone—need RLHF/verification |
| Repetitive outputs | Lower learning rate, fewer epochs, add diversity |
Quantization for efficiency:
Merge LoRA if applicable:
model = model.merge_and_unload() # Bake adapters into base
Convert for inference framework:
# Convert to GGUF for llama.cpp
python convert.py ./sft-model --outtype f16 --outfile model.gguf
# Quantize to 4-bit
./quantize model.gguf model-q4_k_m.gguf q4_k_m
SFT teaches format and basic alignment, but has limitations: it can't reliably reduce hallucinations, optimize for human preferences, or handle safety robustly. Production assistants typically require additional alignment through RLHF (covered next) and ongoing monitoring.
Instruction tuning is the crucial step that transforms powerful but unfocused base models into useful assistants. Through supervised learning on instruction-response pairs, we unlock the latent capabilities that pre-training established.
What's next:
While SFT teaches models to follow instructions, it optimizes for matching training outputs rather than genuine helpfulness. The next page covers RLHF (Reinforcement Learning from Human Feedback)—a technique that directly optimizes models for human preferences, addressing limitations of pure supervised learning and enabling more nuanced alignment.
You now understand instruction tuning—the process that creates useful AI assistants from raw language models. This knowledge enables you to create, evaluate, and deploy custom instruction-following models. Next, we explore RLHF for deeper alignment.