Large Language Models - Learning Module

Loading content...

0/278

RLHF (Reinforcement Learning from Human Feedback)

Aligning AI with Human Preferences

Supervised fine-tuning teaches a model what to say by showing it examples. But how do we teach a model how to say things well—to be helpful without being harmful, informative without being verbose, creative without being inappropriate?

This is the alignment problem in a narrow sense: making AI systems that behave according to human intentions and preferences, even in novel situations not covered by training examples.

Reinforcement Learning from Human Feedback (RLHF) addresses this by learning a model of human preferences and then optimizing the language model to satisfy those preferences. Rather than learning from static examples, the model learns from feedback: which responses do humans prefer?

RLHF is the technique behind the dramatic leap from GPT-3 to ChatGPT. The same base capabilities, but fundamentally different behavior—more helpful, less harmful, more aligned with what users actually want.

What You Will Learn

This page covers the complete RLHF pipeline: collecting human preferences, training reward models, policy optimization with PPO and DPO, and the challenges of preference learning. You will understand how to implement RLHF and reason about its strengths and limitations.

Why SFT Is Not Enough

Supervised fine-tuning has fundamental limitations that RLHF addresses:

Problem 1: Demonstration vs. Preference

SFT teaches by demonstration—the model learns to imitate training examples. But:

For any query, there are many acceptable responses
The "best" response is context-dependent and subjective
Demonstration examples may not represent optimal behavior
Humans can more easily compare responses than generate perfect ones

Consider: "Explain quantum entanglement."

A physicist's explanation differs from a child's. Both might be "correct" for different audiences. SFT picks one; RLHF can learn the preference function.

Problem 2: The Verbosity Trap

SFT models tend toward verbosity because:

Training data often contains detailed, complete responses
Cross-entropy loss rewards predicting every token
No penalty for unnecessary elaboration

RLHF can directly optimize for conciseness when that's preferred.

Problem 3: Hallucination and Honesty

SFT models generate plausible text, not necessarily truthful text:

No loss distinction between true and false fluent statements
Training data contains mistakes (models learn to replicate them)
No incentive to express uncertainty

RLHF can reward honesty and penalize confident falsehoods.

SFT vs. RLHF Capabilities
Capability	SFT Approach	RLHF Approach
Format following	Demonstrate formats in training	Same, RLHF adds refinement
Knowledge & capability	Pre-training establishes	Cannot add, only surface
Response quality	Quality of training data	Learned from comparisons
Nuanced preferences	Hard to encode in examples	Directly optimized
Safety alignment	Include refusal examples	Learn refusal boundaries
Style & tone	One style per dataset	Learn style preferences

The Preference Advantage

Humans express preferences more reliably than they generate ideal responses. It's easier to say 'Response A is better than Response B' than to write the perfect response yourself. RLHF exploits this asymmetry—cheap, reliable preference data rather than expensive, noisy demonstrations.

The RLHF Pipeline

RLHF involves three stages, each with distinct objectives:

Stage 1: Supervised Fine-Tuning (SFT)

Start with a base model, apply instruction tuning as covered previously. This creates the initial policy $\pi_{SFT}$ that can follow instructions.

Purpose: Establish basic instruction-following behavior as a starting point for RL.

Stage 2: Reward Model Training

Collect human preference data and train a model to predict preferences:

Sample prompts from target distribution
Generate multiple responses from $\pi_{SFT}$
Collect human rankings (which response is better?)
Train reward model $r_\theta(x, y)$ to predict human preference

Purpose: Create a differentiable proxy for human judgment.

Stage 3: Policy Optimization

Optimize the language model using reinforcement learning with the reward model:

$$\max_{\pi} \mathbb{E}{x \sim D, y \sim \pi(y|x)} \left[ r\theta(x, y) - \beta \cdot D_{KL}(\pi || \pi_{SFT}) \right]$$

Where:

$r_\theta(x, y)$ is the learned reward
$D_{KL}(\pi || \pi_{SFT})$ prevents the model from deviating too far from SFT
$\beta$ controls the strength of the KL constraint

Purpose: Produce model that maximizes human preference.

Converting Mermaid diagram...

The Three Models

RLHF requires maintaining multiple models: (1) the SFT model as reference, (2) the reward model for scoring, (3) the policy being optimized. For a 70B model, this means ~500GB+ GPU memory during training—a significant infrastructure requirement.

Reward Model Training

The reward model is the core innovation of RLHF—it learns to predict which responses humans prefer and translates this into a scalar score for RL optimization.

Architecture

The reward model is typically the same architecture as the policy, with the language modeling head replaced by a scalar output:

Reward Model = Transformer Backbone + Linear(hidden_dim → 1)

The model processes (prompt, response) and outputs a single scalar reward.

Training Objective

Given preference pairs $(y_w, y_l)$ where $y_w$ is the preferred (winning) response and $y_l$ is the rejected (losing) response:

$$\mathcal{L}{RM} = -\log \sigma(r\theta(x, y_w) - r_\theta(x, y_l))$$

This is the Bradley-Terry model: the probability of preferring $y_w$ over $y_l$ is modeled as:

$$P(y_w \succ y_l | x) = \sigma(r_\theta(x, y_w) - r_\theta(x, y_l))$$

The model learns to assign higher reward to preferred responses.

Reward Model Implementation
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class RewardModel(nn.Module):
    """
    Reward model built on a transformer backbone.
    """
    
    def __init__(self, base_model, hidden_dim):
        super().__init__()
        self.backbone = base_model
        # Remove language modeling head, add reward head
        self.reward_head = nn.Linear(hidden_dim, 1)
        
    def forward(self, input_ids, attention_mask):
        """
        Compute reward for (prompt, response) pair.
        
        Returns:
            rewards: [batch_size, 1] scalar rewards
        """
        # Get hidden states from backbone
        outputs = self.backbone(
            input_ids=input_ids,
            attention_mask=attention_mask,
            output_hidden_states=True
        )
        
        # Use last token's hidden state (end of response)
        last_hidden = outputs.hidden_states[-1]
        
        # Find position of last non-padding token
        sequence_lengths = attention_mask.sum(dim=1) - 1
        last_token_hidden = last_hidden[
            torch.arange(last_hidden.size(0)),
            sequence_lengths
        ]
        
        # Compute scalar reward
        rewards = self.reward_head(last_token_hidden)
        return rewards
 
def compute_rm_loss(model, chosen_ids, rejected_ids, 
                    chosen_mask, rejected_mask):
    """
    Compute Bradley-Terry preference loss.
    """
    chosen_rewards = model(chosen_ids, chosen_mask)
    rejected_rewards = model(rejected_ids, rejected_mask)
    
    # Log-sigmoid of reward difference
    loss = -F.logsigmoid(chosen_rewards - rejected_rewards).mean()
    
    # Accuracy for monitoring
    accuracy = (chosen_rewards > rejected_rewards).float().mean()
    
    return loss, accuracy

Collecting Preference Data

Preference data collection is expensive but critical:

Data Collection Pipeline:

Prompt Selection: Sample diverse prompts from target distribution
Response Generation: Generate 2-4 responses per prompt (vary temperature)
Annotator Instructions: Clear guidelines on what "better" means
Ranking Collection: Pairwise comparisons or full rankings
Quality Control: Multiple annotators, inter-annotator agreement

Annotation Guidelines Example:

When comparing responses, prefer responses that are:

1. Helpful: Actually answers the question
2. Honest: Doesn't make claims without basis
3. Harmless: Avoids generating harmful content
4. Concise: Doesn't pad with unnecessary content
5. Well-formatted: Easy to read and understand

When in conflict: Harmless > Honest > Helpful

Data Component	Typical Scale	Cost
Diverse prompts	20K-100K	~$5K (collection)
Generated responses	2-4 per prompt	Compute cost
Human comparisons	50K-500K pairs	$100K-$1M
Quality checking	10% re-annotated	10% additional

Reward Hacking Warning

Reward models have limited accuracy. If the policy overoptimizes, it finds adversarial examples that score highly on the reward model but poorly to real humans. This is 'reward hacking'—the reward model's failures become the policy's objectives.

Policy Optimization with PPO

Proximal Policy Optimization (PPO) is the most common algorithm for RLHF. It optimizes the policy while constraining how much it can change per update.

The RLHF Objective

We want to maximize reward while staying close to the SFT model:

$$J(\pi) = \mathbb{E}{x \sim D, y \sim \pi(y|x)} \left[ r\theta(x, y) \right] - \beta \cdot D_{KL}(\pi || \pi_{SFT})$$

The KL penalty serves multiple purposes:

Prevents reward hacking (policy exploiting RM weaknesses)
Preserves pre-trained capabilities and fluency
Maintains diversity (prevents mode collapse)

PPO Algorithm for LLMs

Step 1: Sample rollouts

for prompt in batch:
    response = policy.generate(prompt)  # Sample from current policy
    reward = reward_model(prompt, response)  # Score with RM
    old_logprobs = policy.log_prob(response | prompt)  # For PPO ratio

Step 2: Compute advantages

# Per-token rewards (reward assigned to final token, or distributed)
token_rewards = compute_token_rewards(response, reward)

# Value function estimates
values = value_model(prompt + response)

# GAE (Generalized Advantage Estimation)
advantages = compute_gae(token_rewards, values, gamma=1.0, lam=0.95)

Step 3: PPO update

for _ in range(ppo_epochs):
    new_logprobs = policy.log_prob(response | prompt)
    ratio = exp(new_logprobs - old_logprobs)
    
    # Clipped objective
    pg_loss1 = -advantages * ratio
    pg_loss2 = -advantages * clip(ratio, 1-ε, 1+ε)
    pg_loss = max(pg_loss1, pg_loss2).mean()
    
    # Value loss
    value_loss = (values - returns).pow(2).mean()
    
    # KL penalty
    kl = compute_kl(policy, sft_policy, prompt, response)
    
    # Total loss
    loss = pg_loss + 0.5 * value_loss + β * kl
    
    loss.backward()
    optimizer.step()

RLHF Training Loop
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
from trl import PPOTrainer, PPOConfig
 
# Configuration
ppo_config = PPOConfig(
    model_name="sft-model",
    learning_rate=1.41e-5,
    batch_size=256,
    mini_batch_size=64,
    ppo_epochs=4,                  # PPO updates per batch
    gradient_accumulation_steps=4,
    
    # PPO-specific
    cliprange=0.2,                 # PPO clip parameter ε
    cliprange_value=0.2,           # Value function clip
    vf_coef=0.1,                   # Value loss coefficient
    
    # KL control
    init_kl_coef=0.2,              # Initial β
    target_kl=6.0,                 # Adaptive KL target
    adap_kl_ctrl=True,             # Adapt β to maintain target
    
    # Generation
    max_new_tokens=256,
    temperature=1.0,
)
 
# Initialize trainer
trainer = PPOTrainer(
    config=ppo_config,
    model=policy_model,
    ref_model=sft_model,          # Reference for KL
    tokenizer=tokenizer,
    dataset=prompt_dataset,
    reward_model=reward_model,
)
 
# Training loop
for epoch in range(num_epochs):
    for batch in trainer.dataloader:
        # Generate responses
        query_tensors = batch["input_ids"]
        response_tensors = trainer.generate(query_tensors)
        
        # Compute rewards
        texts = [tokenizer.decode(r) for r in response_tensors]
        rewards = [reward_model.score(t) for t in texts]
        
        # PPO step
        stats = trainer.step(query_tensors, response_tensors, rewards)
        
        # Log metrics
        print(f"Mean reward: {stats['ppo/mean_rewards']:.3f}")
        print(f"KL divergence: {stats['ppo/kl']:.3f}")

Hyperparameter Sensitivity

Hyperparameter	Typical Range	Too Low	Too High
KL coefficient (β)	0.01-0.2	Reward hacking	Stuck at SFT
PPO clip (ε)	0.1-0.3	Slow learning	Unstable updates
Learning rate	1e-6 to 5e-5	Slow convergence	KL explosion
PPO epochs	1-4	Underutilize data	Overfitting batch
Batch size	64-512	High variance	Less exploration

Adaptive KL control is crucial:

# Adjust β based on observed KL
if observed_kl > target_kl:
    β *= 1.5  # KL too high, increase penalty
else:
    β /= 1.5  # KL too low, decrease penalty

This maintains KL in a reasonable range without manual tuning.

RLHF Is Unstable

RLHF training is notoriously sensitive. Common failure modes: KL divergence explosion, reward model exploitation, mode collapse, and perplexity degradation. Monitor closely: reward, KL, response diversity, and downstream task performance. Be prepared to restart from checkpoint.

Direct Preference Optimization (DPO)

PPO-based RLHF is effective but complex—requiring separate reward and value models, careful hyperparameter tuning, and online generation. Direct Preference Optimization (DPO) provides a simpler alternative that achieves similar results.

The Key Insight

DPO observes that the optimal policy under the RLHF objective has a closed-form relationship with the reward:

$$\pi^*(y|x) = \frac{1}{Z(x)} \pi_{ref}(y|x) \exp\left(\frac{1}{\beta} r(x, y)\right)$$

Solving for the reward:

$$r(x, y) = \beta \log \frac{\pi^*(y|x)}{\pi_{ref}(y|x)} + \beta \log Z(x)$$

Substituting into the Bradley-Terry preference model, the partition function $Z(x)$ cancels, yielding the DPO loss:

$$\mathcal{L}{DPO} = -\log \sigma\left(\beta \log \frac{\pi\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\right)$$

This is a supervised learning objective on preference data—no RL needed!

DPO Implementation
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import torch
import torch.nn.functional as F
 
def compute_dpo_loss(
    policy_model,
    reference_model,
    chosen_ids,      # Preferred response tokens
    rejected_ids,    # Rejected response tokens
    attention_mask,
    beta=0.1         # KL penalty strength
):
    """
    Compute DPO loss for a batch of preference pairs.
    """
    # Get log probabilities from policy model
    policy_chosen_logprobs = get_sequence_logprobs(
        policy_model, chosen_ids, attention_mask
    )
    policy_rejected_logprobs = get_sequence_logprobs(
        policy_model, rejected_ids, attention_mask
    )
    
    # Get log probabilities from reference model (frozen)
    with torch.no_grad():
        ref_chosen_logprobs = get_sequence_logprobs(
            reference_model, chosen_ids, attention_mask
        )
        ref_rejected_logprobs = get_sequence_logprobs(
            reference_model, rejected_ids, attention_mask
        )
    
    # Compute log ratios
    chosen_log_ratio = policy_chosen_logprobs - ref_chosen_logprobs
    rejected_log_ratio = policy_rejected_logprobs - ref_rejected_logprobs
    
    # DPO loss: log-sigmoid of scaled difference
    logits = beta * (chosen_log_ratio - rejected_log_ratio)
    loss = -F.logsigmoid(logits).mean()
    
    # Metrics
    chosen_reward = beta * chosen_log_ratio.detach()
    rejected_reward = beta * rejected_log_ratio.detach()
    reward_margin = (chosen_reward - rejected_reward).mean()
    accuracy = (chosen_reward > rejected_reward).float().mean()
    
    return loss, {
        "reward_margin": reward_margin,
        "accuracy": accuracy,
        "chosen_reward": chosen_reward.mean(),
        "rejected_reward": rejected_reward.mean(),
    }
 
def get_sequence_logprobs(model, input_ids, attention_mask):
    """Compute sum of log probabilities for a sequence."""
    outputs = model(input_ids, attention_mask=attention_mask)
    logits = outputs.logits[:, :-1, :]  # Predict next token
    labels = input_ids[:, 1:]           # Shifted labels
    
    log_probs = F.log_softmax(logits, dim=-1)
    selected_log_probs = log_probs.gather(
        dim=-1, index=labels.unsqueeze(-1)
    ).squeeze(-1)
    
    # Sum over sequence (masking padding)
    mask = attention_mask[:, 1:]  # Match shifted length
    return (selected_log_probs * mask).sum(dim=1)

DPO Advantages

•No separate reward model needed
•No value function to train
•No online generation required
•Simple supervised training loop
•Stable, predictable convergence
•Lower memory footprint

DPO Limitations

•Requires pre-collected preference data
•Can't explore beyond training data
•May underperform on distribution shifts
•Reference model must be accessible
•β tuning still required
•Potential for overfitting to preferences

DPO Variants and Extensions

IPO (Identity Preference Optimization): Addresses overfitting in DPO by using a different loss formulation that doesn't drive log-ratios to infinity.

KTO (Kahneman-Tversky Optimization): Learns from binary feedback (good/bad) rather than preferences. Useful when comparison data is unavailable.

ORPO (Odds Ratio Preference Optimization): Combines SFT and preference learning in a single objective, avoiding the need for a separate SFT phase.

SimPO (Simple Preference Optimization): Removes the need for a reference model by using sequence length normalization.

Method	Reference Model?	RL?	Data Type	Complexity
PPO	Yes	Yes	Online generation	High
DPO	Yes	No	Offline preferences	Medium
KTO	Optional	No	Binary feedback	Low
SimPO	No	No	Offline preferences	Low

When to Use DPO vs. PPO

Use DPO when: you have high-quality preference data, want simpler training, or have limited compute. Use PPO when: you need online exploration, have access to reward model, or require iterative improvement with human feedback in the loop.

Challenges and Limitations of RLHF

RLHF is powerful but imperfect. Understanding its limitations is crucial for responsible deployment.

Challenge 1: Reward Hacking

The policy finds ways to score highly on the reward model without satisfying actual human preferences:

Examples:

Generating longer responses (reward model learned length correlates with quality)
Using confident language regardless of accuracy
Including formatting that annotators liked but that isn't actually helpful
Sycophancy—agreeing with user's stated preferences

Mitigations:

Train more robust reward models with diverse data
Use ensembles of reward models
Apply KL constraint aggressively
Include adversarial examples in RM training
Regular human evaluation to catch drift

Challenge 2: Whose Preferences?

RLHF optimizes for the preferences of annotators, which may not represent all users:

Annotator demographics skew toward specific populations
Professional annotators develop specialized behaviors
Cultural and value differences aren't captured
Majority preferences may override minority needs

The alignment target is fundamentally contested. Different users want different behaviors from AI. RLHF produces a model aligned to a particular conception of "good."

Challenge 3: Specification Gaming

Human preferences are complex and context-dependent. Any fixed reward model inevitably oversimplifies:

What We Want	What RM Might Reward
Honest uncertainty	Confident-sounding hedging
Helpful refusal	Refusing too often to be safe
Concise answers	Incomplete answers
Creative responses	Unpredictable responses

Challenge 4: Capability vs. Alignment

RLHF primarily changes how the model presents information, not what it knows:

Cannot instill new knowledge
Cannot fix deep reasoning limitations
May hide rather than fix flaws
Creates illusion of alignment

Concerning: A model that confidently refuses to help with harmful requests... but could be jailbroken because the underlying capability remains.

RLHF Is Not Solved AI Safety

RLHF improves model behavior significantly but does not ensure models are safe, honest, or aligned in any deep sense. It's a technique for teaching preferences, not values. Systems aligned with RLHF still require careful monitoring, use restrictions, and ongoing safety work.

Ongoing Research Directions

•Constitutional AI: Models that evaluate their own outputs against principles
•Debate: Training models to critique each other for better oversight
•Process supervision: Rewarding reasoning steps, not just outputs
•Scalable oversight: Methods that work as models become more capable
•Reward model robustness: Adversarial training, interpretability
•Multi-objective RLHF: Explicitly balancing helpfulness, harmlessness, honesty

Summary: Learning from Human Feedback

RLHF represents a fundamental shift from imitation to optimization—teaching models to satisfy human preferences rather than copy human outputs. This enables more nuanced alignment than SFT alone.

Key Takeaways

•RLHF addresses SFT limitations — Learns nuanced preferences that demonstrations can't easily encode
•Three-stage pipeline — SFT → Reward Model → PPO optimization, each with distinct requirements
•Reward models predict preferences — Bradley-Terry model trained on human comparisons
•PPO with KL constraint — Maximizes reward while staying close to SFT policy to prevent drift
•DPO simplifies training — Direct optimization on preferences without RL, achieving comparable results
•Reward hacking is a key risk — Models optimize for RM proxy, not true human preference
•Alignment is incomplete — RLHF improves behavior but doesn't ensure safety or deep alignment

What's next:

With an understanding of how LLMs are trained and aligned, we turn to how they're used. The next page covers prompting and in-context learning—the art and science of getting the best out of language models through careful prompt design, few-shot examples, and chain-of-thought reasoning.

Page Complete

You now understand RLHF—the technique that transforms capable but uncontrolled language models into helpful assistants aligned with human preferences. This knowledge enables you to implement, evaluate, and reason about aligned AI systems. Next, we explore prompting and in-context learning.