Large Language Models - Learning Module

Loading content...

0/278

Instruction Tuning

From Language Model to Assistant

A base language model trained purely on next-token prediction is a powerful engine, but it behaves like an autocomplete system: give it "The capital of France is" and it will output "Paris", but ask it "What is the capital of France?" and it may continue with "A question for today's geography lesson..." rather than answering.

This is the instruction-following gap. Base models learn to predict text, but not to follow instructions. They lack the alignment between user intent and model behavior that makes an LLM useful as an assistant, coding tool, or reasoning engine.

Instruction tuning bridges this gap. Through supervised fine-tuning on instruction-response pairs, we teach models to recognize instructions and respond appropriately. This seemingly simple step—showing the model examples of helpful responses—produces the dramatic behavioral shift that transforms a language model into ChatGPT.

What You Will Learn

This page covers the complete instruction tuning pipeline: the motivation and mechanics, instruction dataset creation, supervised fine-tuning techniques, and the challenges of creating genuinely helpful instruction-following models. You will understand how to transform any capable base model into an effective assistant.

The Instruction-Following Problem

Base language models are trained to model the distribution of text on the internet. This creates a fundamental misalignment with how we want to use them:

What Base Models Learn

Given the prompt "Write a poem about autumn", a base model predicts what typically follows such text in its training data. This might be:

An actual poem (if it was trained on poetry books)
An essay about poetry (if trained on writing guides)
Instructions on how to write poetry (if trained on educational content)
A continuation of a conversation ("Sure, here's a poem...")
Completely unrelated text that happens to be statistically likely

The base model doesn't "understand" that you want a poem—it predicts tokens. Sometimes prediction and instruction-following coincide; often they don't.

Base Model Behaviors

•Continues text in any direction
•May ignore or misinterpret instructions
•Produces inconsistent formats
•No concept of task completion
•Mixes helpful and harmful content
•No awareness of being an assistant

Instruction-Tuned Behaviors

•Responds directly to instructions
•Understands user intent from context
•Produces structured, expected formats
•Knows when to stop responding
•Filtered to be helpful and safe
•Consistent assistant persona

The Surprising Efficiency of Instruction Tuning

Given the dramatic behavioral change, one might expect instruction tuning to require retraining from scratch. Remarkably, the opposite is true:

Training Phase	Compute (GPT-3 scale)	Tokens	Effect
Pre-training	~$5M (3×10²³ FLOPS)	300B	Acquire knowledge and capabilities
Instruction Tuning	~$100K (<1% of pre-training)	10-100M	Align behavior to instructions

Key insight: Instruction tuning doesn't teach new knowledge—it teaches the model to access and apply knowledge it already has. The capability was latent in the base model; instruction tuning unlocks it.

This efficiency has profound implications:

Accessibility: Anyone with a base model can create their own instruction-tuned variant
Customization: Different instruction sets produce different assistant behaviors
Iteration: Rapid experimentation with alignment approaches becomes feasible
Specialization: Domain-specific instruction tuning creates expert assistants

The Superficial Alignment Hypothesis

Current evidence suggests that instruction tuning primarily teaches the model a format and style—the 'surface' of being helpful—rather than fundamentally changing its underlying capabilities. The base model's knowledge and reasoning ability largely determine the ceiling; instruction tuning teaches the model to consistently reach that ceiling.

Designing Instruction Datasets

The quality and diversity of instruction data fundamentally determines the resulting model's capabilities. Creating effective instruction datasets is both art and science.

Anatomy of an Instruction Example

Instruction examples typically follow a structured format:

{
    "instruction": "Write a Python function that calculates factorial",
    "input": "(optional) Additional context or parameters",
    "output": "def factorial(n):
    if n <= 1:
        return 1
    return n * factorial(n - 1)"
}

The model learns to map (instruction, input) pairs to appropriate outputs. During inference, users provide the instruction/input, and the model generates the output.

Instruction Dataset Categories
Category	Examples	Purpose
Open-ended Generation	Write a story, compose a poem, brainstorm ideas	Creative capability, fluent generation
Closed QA	What is the capital of France?	Factual recall, concise answering
Open QA	Explain why the sky is blue	Explanation, teaching, reasoning shown
Task Completion	Summarize this text, translate this, extract entities	Following specific procedures
Conversation	Multi-turn dialogues with context	Coherent multi-turn interaction
Math/Reasoning	Solve 2x + 5 = 15, logic puzzles	Step-by-step reasoning
Coding	Write/debug/explain code	Programming capability
Refusal	How to harm someone → I can't help with that	Safety alignment

Data Collection Strategies

1. Human-Written Data

The highest quality but most expensive approach. Human annotators create both instructions and responses.

Pros: High quality, nuanced, authentic human intent Cons: Expensive ($10-50 per high-quality example), slow to scale

2. Template-Based Generation

Define templates and fill with variations:

Templates:
- "Explain {concept} as if I were {age_group}"
- "Compare and contrast {A} and {B}"
- "Write a {length} {genre} story about {topic}"

Pros: Scalable, controllable diversity Cons: May lack naturalness, template artifacts

3. Model-Generated Data (Self-Instruct)

Use an existing LLM to generate instruction-response pairs:

Seed with small set of human examples
Prompt model to generate new instructions
Filter for quality and diversity
Generate responses to new instructions
Human verification (optional)

Pros: Highly scalable, leverages model capability Cons: May amplify model biases, quality ceiling limited by generator

4. Distillation from Stronger Models

Collect outputs from GPT-4 or Claude to train smaller models:

Pros: High-quality outputs, scalable Cons: Legal concerns (ToS violations), capability limited by source

Self-Instruct Data Generation
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import openai
import random
 
class SelfInstructGenerator:
    """
    Generate instruction-tuning data using the Self-Instruct method.
    """
    
    def __init__(self, seed_instructions: list[dict], model: str = "gpt-4"):
        self.seed_instructions = seed_instructions
        self.generated_instructions = []
        self.model = model
        
    def generate_new_instruction(self) -> dict | None:
        """
        Generate a new instruction inspired by existing ones.
        """
        # Sample few-shot examples
        examples = random.sample(
            self.seed_instructions + self.generated_instructions[-100:],
            min(8, len(self.seed_instructions))
        )
        
        prompt = self._build_generation_prompt(examples)
        
        response = openai.ChatCompletion.create(
            model=self.model,
            messages=[
                {"role": "system", "content": "You are a creative assistant."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.8,  # Higher temp for diversity
        )
        
        instruction = self._parse_instruction(response)
        
        # Quality filters
        if not self._passes_quality_check(instruction):
            return None
            
        return instruction
    
    def _passes_quality_check(self, instruction: dict) -> bool:
        """
        Filter low-quality or duplicate instructions.
        """
        text = instruction.get("instruction", "")
        
        # Length constraints
        if len(text.split()) < 5 or len(text.split()) > 100:
            return False
            
        # Diversity check (ROUGE-L against existing)
        for existing in self.generated_instructions[-1000:]:
            if self._rouge_l(text, existing["instruction"]) > 0.7:
                return False
                
        return True

Quality Over Quantity

Research consistently shows that diversity and quality matter more than sheer volume. 10,000 high-quality, diverse instructions often outperform 1 million noisy examples. Invest in curation and filtering.

Supervised Fine-Tuning (SFT)

Supervised Fine-Tuning (SFT) is the process of continuing training on instruction-response pairs. While conceptually simple, effective SFT requires careful attention to formatting, hyperparameters, and training dynamics.

The SFT Training Objective

Like pre-training, SFT uses cross-entropy loss on next-token prediction. The key difference is which tokens are included in the loss:

Conversation:
    [System: You are a helpful assistant]
    [User: What is 2+2?]
    [Assistant: The answer is 4.]

Loss Masking:
    [System: ...] → Loss = 0 (not predicting system prompt)
    [User: ...] → Loss = 0 (not predicting user input)
    [Assistant: The answer is 4.] → Loss computed here only

The model only learns to predict the assistant's tokens, not to reconstruct the prompt. This focuses learning on response generation.

SFT Training Implementation
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
 
def prepare_sft_batch(examples, tokenizer, max_length=2048):
    """
    Prepare a batch for supervised fine-tuning with proper loss masking.
    """
    input_ids_list = []
    labels_list = []
    
    for example in examples:
        # Format the conversation
        conversation = format_conversation(
            system=example.get("system", "You are a helpful assistant."),
            user=example["instruction"],
            assistant=example["output"]
        )
        
        # Tokenize
        tokens = tokenizer.encode(conversation, add_special_tokens=True)
        
        # Find where assistant response starts
        assistant_start = find_assistant_start(tokens, tokenizer)
        
        # Create labels: -100 for prompt tokens (ignored in loss)
        labels = [-100] * assistant_start + tokens[assistant_start:]
        
        # Truncate if necessary
        if len(tokens) > max_length:
            tokens = tokens[:max_length]
            labels = labels[:max_length]
        
        input_ids_list.append(tokens)
        labels_list.append(labels)
    
    # Pad to same length
    input_ids = pad_sequences(input_ids_list, tokenizer.pad_token_id)
    labels = pad_sequences(labels_list, -100)
    
    return {
        "input_ids": torch.tensor(input_ids),
        "labels": torch.tensor(labels),
        "attention_mask": (torch.tensor(input_ids) != tokenizer.pad_token_id)
    }
 
def format_conversation(system: str, user: str, assistant: str) -> str:
    """
    Format using ChatML or similar template.
    """
    return f"""<|im_start|>system
{system}<|im_end|>
<|im_start|>user
{user}<|im_end|>
<|im_start|>assistant
{assistant}<|im_end|>"""
 
# Typical SFT hyperparameters
sft_config = {
    "learning_rate": 2e-5,        # Lower than pre-training
    "batch_size": 128,            # Smaller batches sufficient
    "epochs": 2-4,                # Few epochs to avoid overfitting
    "warmup_ratio": 0.03,         # Short warmup
    "weight_decay": 0.0,          # Often disabled for fine-tuning
    "lr_scheduler": "cosine",
}

SFT Hyperparameters and Best Practices

Hyperparameter	Typical Range	Reasoning
Learning Rate	1e-6 to 5e-5	Too high → catastrophic forgetting
Epochs	1-4	More epochs → overfitting, style collapse
Batch Size	64-256	Smaller batches often work well
Context Length	2048-8192	Match expected deployment context
Weight Decay	0.0-0.01	Less regularization vs pre-training
LoRA Rank	8-128	If using parameter-efficient fine-tuning

Common pitfalls:

Catastrophic Forgetting: Too many epochs or high LR erases pre-trained knowledge
Style Collapse: Model converges to formulaic responses
Data Leakage: Training data appears in benchmarks (invalidating evaluation)
Distribution Mismatch: Training on one format, deploying with another
Overfitting: Model memorizes training examples rather than generalizing

SFT Best Practices

•Start with lower learning rates — Begin at 1e-5, adjust based on loss curves. Easier to increase than recover from too-high.
•Monitor validation loss closely — Overfitting happens fast. Stop when validation loss plateaus or increases.
•Use chat template consistently — The exact format (ChatML, Alpaca, Vicuna) matters. Use the same format at training and inference.
•Balance instruction types — Ensure diversity across categories. Overrepresentation creates bias.
•Include multi-turn examples — Real conversations are multi-turn. Train on them.
•Evaluate on held-out instructions — Reserve 1-5% of instructions for evaluation, never train on these.

Evaluating Instruction-Following Models

Evaluating instruction-following models is challenging because "good response" is subjective and multidimensional. Multiple complementary approaches are necessary.

Automatic Evaluation

1. Benchmark Accuracy

Standard ML benchmarks measure capability:

Benchmark	Measures	Format
MMLU	Knowledge across 57 subjects	Multiple choice
HellaSwag	Common-sense reasoning	Multiple choice
GSM8K	Math word problems	Free-form answer
HumanEval	Code generation	Functional correctness
TruthfulQA	Avoiding falsehoods	Multiple choice/free
MT-Bench	Multi-turn conversation	Free-form (LLM judged)

2. LLM-as-Judge

Use a stronger model (GPT-4) to evaluate responses:

judge_prompt = """
Evaluate the following response to the given instruction.
Rate on a scale of 1-10 for:
- Helpfulness
- Accuracy
- Relevance
- Completeness

Instruction: {instruction}
Response: {response}

Provide ratings and brief justification.
"""

Advantages: Scalable, correlates with human judgment Disadvantages: Bias toward verbose responses, judge model limitations

3. Rule-Based Evaluation

For specific tasks, programmatic checks work well:

def evaluate_code_response(response: str, test_cases: list) -> dict:
    """Evaluate code generation by running tests."""
    try:
        exec(response)  # Execute generated code
        passed = sum(1 for tc in test_cases if run_test(tc))
        return {"pass_rate": passed / len(test_cases)}
    except Exception as e:
        return {"pass_rate": 0, "error": str(e)}

def evaluate_format_following(response: str, expected_format: str) -> bool:
    """Check if response follows JSON/XML/etc format."""
    try:
        if expected_format == "json":
            json.loads(response)
            return True
    except:
        return False

Human Evaluation

Human evaluation remains the gold standard but is expensive and slow:

Pairwise Comparison (Arena-style):

Show evaluators responses from two models (anonymized)
Ask which response is better
Aggregate into Elo ratings
Used by Chatbot Arena, produces reliable rankings

Likert Scale Rating:

Rate individual responses 1-5 on dimensions
More granular but less reliable (calibration varies)

Preference Ranking:

Rank 3+ responses best to worst
Useful for training preference models (RLHF)

The Evaluation Gap

No single metric captures 'assistant quality.' Real-world usefulness depends on task distribution (which tasks do users actually ask?), interaction patterns (multi-turn, clarification-seeking), and subjective preferences (verbosity, tone, format). Deploy with monitoring and iterate based on real usage.

Evaluation Strategy by Context
Context	Primary Evaluation	Secondary Evaluation
Research/Ablation	Benchmark accuracy (MMLU, etc.)	LLM-as-judge on sample
Pre-deployment	MT-Bench, human eval on key scenarios	Safety evaluation suite
Production	User retention, thumbs up/down ratio	A/B test against baseline
Safety Review	Red-teaming, adversarial prompts	Automated toxicity scoring

Advanced Instruction Tuning Techniques

Beyond basic SFT, several techniques improve instruction-following quality and efficiency.

Parameter-Efficient Fine-Tuning (PEFT)

Full fine-tuning modifies all parameters. PEFT methods update only a small fraction:

LoRA (Low-Rank Adaptation)

Instead of updating weight matrix $W$, learn low-rank update: $$W' = W + BA$$

Where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$, with rank $r \ll \min(d, k)$.

Full Fine-tuning	LoRA (r=8)	Memory Savings
7B parameters	~17M trainable	99.7% reduction
140 GB (7B in bf16 + optimizer)	~4 GB	97% reduction

QLoRA

Combines LoRA with 4-bit quantization:

Base model quantized to 4-bit (1.75 bytes/param)
LoRA adapters in full precision
Enables fine-tuning 65B models on single 48GB GPU

LoRA Configuration
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM
 
# Load base model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
 
# Configure LoRA
lora_config = LoraConfig(
    r=16,                          # Rank of update matrices
    lora_alpha=32,                 # Scaling factor (often 2×r)
    target_modules=[               # Which layers to adapt
        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention
        "gate_proj", "up_proj", "down_proj",     # MLP
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
 
# Create PEFT model
model = get_peft_model(model, lora_config)
 
# Print trainable parameters
model.print_trainable_parameters()
# Output: trainable params: 18,743,296 || all params: 6,756,392,960 
#         || trainable%: 0.277%
 
# Training proceeds as normal, but only LoRA params are updated

NEFTune: Noise Embeddings for Fine-Tuning

A simple technique that improves SFT by adding noise to embeddings:

def neftune_forward(embeddings, noise_alpha=5.0):
    """Add noise to embeddings during training."""
    if model.training:
        # Compute noise scaling
        dims = embeddings.shape[-1]
        mag_norm = noise_alpha / (dims ** 0.5)
        
        # Add uniform noise
        noise = torch.zeros_like(embeddings).uniform_(-1, 1) * mag_norm
        embeddings = embeddings + noise
    
    return embeddings

This regularization reduces overfitting and improves generalization, especially on smaller datasets. Accuracy improvements of 10-15% on benchmarks have been reported.

Curriculum Learning

Order training examples from simpler to more complex:

Phase 1: Simple Q&A, single-turn, short responses
Phase 2: Multi-step tasks, structured outputs
Phase 3: Complex reasoning, long-form generation
Phase 4: Multi-turn conversations, edge cases

Intuition: Establish basic patterns before tackling complexity.

Multi-Task Instruction Tuning

Training on diverse tasks jointly improves generalization:

mixture = {
    "general_qa": 0.25,
    "coding": 0.20,
    "reasoning": 0.15,
    "creative_writing": 0.15,
    "summarization": 0.10,
    "translation": 0.10,
    "safety_examples": 0.05,
}

# Sample each batch according to mixture weights

The mixture significantly affects model behavior. Oversampling coding produces a coding-focused model; oversampling safety produces a cautious one.

When to Use Full Fine-Tuning vs. PEFT

Use PEFT (LoRA/QLoRA) for: limited compute, experimentation, preserving base capabilities. Use full fine-tuning for: maximum quality, significant distribution shift, when compute is available. In practice, LoRA achieves 90-95% of full fine-tuning quality at a fraction of the cost.

The Complete Pipeline: Base Model to Assistant

Let's trace the complete journey from a base language model to a production assistant.

Stage 1: Base Model Selection

Choose a pre-trained model based on:

Criterion	Considerations
Size	Larger = more capable, more expensive to run
License	Apache/MIT for commercial, LLaMA license, proprietary
Pre-training data	Recency, multilinguality, code content
Architecture	Standard transformer? Efficient variants?
Ecosystem	Quantization support, inference frameworks

Popular choices: LLaMA 2/3, Mistral, Qwen, Gemma, Phi

Stage 2: Data Preparation

Collect/generate instructions (10K-1M examples)
Format consistently (ChatML, Alpaca, custom)
Quality filter (length, coherence, safety)
Deduplicate (avoid memorization)
Create train/val splits (reserve diverse held-out set)

Stage 3: Supervised Fine-Tuning

# Typical training run
training_args = TrainingArguments(
    output_dir="./sft-model",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    learning_rate=2e-5,
    num_train_epochs=3,
    warmup_ratio=0.03,
    logging_steps=10,
    save_strategy="steps",
    save_steps=200,
    evaluation_strategy="steps",
    eval_steps=200,
    bf16=True,
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    dataset_text_field="text",
    max_seq_length=2048,
)

trainer.train()

Stage 4: Evaluation and Iteration

Evaluation checklist:

Validation loss converged, not overfitting
MT-Bench score improved vs. base model
Task-specific benchmarks tested (code, math, etc.)
Manual spot-checking across instruction types
Safety evaluation passed
Format consistency verified

Common iteration patterns:

Observation	Likely Fix
Refuses to answer basic questions	Add more diverse Q&A examples
Responses too verbose	Add concise examples, reduce verbosity in data
Poor code quality	Add more code examples, higher quality code data
Format inconsistent	Ensure consistent template, add format examples
Hallucinating facts	Can't fix with SFT alone—need RLHF/verification
Repetitive outputs	Lower learning rate, fewer epochs, add diversity

Stage 5: Deployment Preparation

Quantization for efficiency:

GPTQ/AWQ for 4-bit deployment
Reduces memory 4×, maintains most accuracy

Merge LoRA if applicable:

model = model.merge_and_unload()  # Bake adapters into base

Convert for inference framework:

# Convert to GGUF for llama.cpp
python convert.py ./sft-model --outtype f16 --outfile model.gguf

# Quantize to 4-bit
./quantize model.gguf model-q4_k_m.gguf q4_k_m

SFT Is Necessary But Not Sufficient

SFT teaches format and basic alignment, but has limitations: it can't reliably reduce hallucinations, optimize for human preferences, or handle safety robustly. Production assistants typically require additional alignment through RLHF (covered next) and ongoing monitoring.

Summary: Teaching Models to Follow Instructions

Instruction tuning is the crucial step that transforms powerful but unfocused base models into useful assistants. Through supervised learning on instruction-response pairs, we unlock the latent capabilities that pre-training established.

Key Takeaways

•Base models predict, not assist — Pre-training creates capability; instruction tuning creates controllability
•SFT is remarkably efficient — A few thousand examples suffice to dramatically change model behavior
•Data quality trumps quantity — Diverse, high-quality instructions are more valuable than volume
•Formatting matters — Consistent chat templates ensure reliable behavior at inference
•PEFT enables accessibility — LoRA/QLoRA allow fine-tuning large models on consumer hardware
•Evaluation is multifaceted — Combine benchmarks, LLM-as-judge, and human evaluation
•SFT is one piece — Full alignment requires additional techniques (RLHF) and ongoing monitoring

What's next:

While SFT teaches models to follow instructions, it optimizes for matching training outputs rather than genuine helpfulness. The next page covers RLHF (Reinforcement Learning from Human Feedback)—a technique that directly optimizes models for human preferences, addressing limitations of pure supervised learning and enabling more nuanced alignment.

Page Complete

You now understand instruction tuning—the process that creates useful AI assistants from raw language models. This knowledge enables you to create, evaluate, and deploy custom instruction-following models. Next, we explore RLHF for deeper alignment.