Loading content...
Supervised fine-tuning teaches a model what to say by showing it examples. But how do we teach a model how to say things well—to be helpful without being harmful, informative without being verbose, creative without being inappropriate?
This is the alignment problem in a narrow sense: making AI systems that behave according to human intentions and preferences, even in novel situations not covered by training examples.
Reinforcement Learning from Human Feedback (RLHF) addresses this by learning a model of human preferences and then optimizing the language model to satisfy those preferences. Rather than learning from static examples, the model learns from feedback: which responses do humans prefer?
RLHF is the technique behind the dramatic leap from GPT-3 to ChatGPT. The same base capabilities, but fundamentally different behavior—more helpful, less harmful, more aligned with what users actually want.
This page covers the complete RLHF pipeline: collecting human preferences, training reward models, policy optimization with PPO and DPO, and the challenges of preference learning. You will understand how to implement RLHF and reason about its strengths and limitations.
Supervised fine-tuning has fundamental limitations that RLHF addresses:
SFT teaches by demonstration—the model learns to imitate training examples. But:
Consider: "Explain quantum entanglement."
A physicist's explanation differs from a child's. Both might be "correct" for different audiences. SFT picks one; RLHF can learn the preference function.
SFT models tend toward verbosity because:
RLHF can directly optimize for conciseness when that's preferred.
SFT models generate plausible text, not necessarily truthful text:
RLHF can reward honesty and penalize confident falsehoods.
| Capability | SFT Approach | RLHF Approach |
|---|---|---|
| Format following | Demonstrate formats in training | Same, RLHF adds refinement |
| Knowledge & capability | Pre-training establishes | Cannot add, only surface |
| Response quality | Quality of training data | Learned from comparisons |
| Nuanced preferences | Hard to encode in examples | Directly optimized |
| Safety alignment | Include refusal examples | Learn refusal boundaries |
| Style & tone | One style per dataset | Learn style preferences |
Humans express preferences more reliably than they generate ideal responses. It's easier to say 'Response A is better than Response B' than to write the perfect response yourself. RLHF exploits this asymmetry—cheap, reliable preference data rather than expensive, noisy demonstrations.
RLHF involves three stages, each with distinct objectives:
Start with a base model, apply instruction tuning as covered previously. This creates the initial policy $\pi_{SFT}$ that can follow instructions.
Purpose: Establish basic instruction-following behavior as a starting point for RL.
Collect human preference data and train a model to predict preferences:
Purpose: Create a differentiable proxy for human judgment.
Optimize the language model using reinforcement learning with the reward model:
$$\max_{\pi} \mathbb{E}{x \sim D, y \sim \pi(y|x)} \left[ r\theta(x, y) - \beta \cdot D_{KL}(\pi || \pi_{SFT}) \right]$$
Where:
Purpose: Produce model that maximizes human preference.
RLHF requires maintaining multiple models: (1) the SFT model as reference, (2) the reward model for scoring, (3) the policy being optimized. For a 70B model, this means ~500GB+ GPU memory during training—a significant infrastructure requirement.
The reward model is the core innovation of RLHF—it learns to predict which responses humans prefer and translates this into a scalar score for RL optimization.
The reward model is typically the same architecture as the policy, with the language modeling head replaced by a scalar output:
Reward Model = Transformer Backbone + Linear(hidden_dim → 1)
The model processes (prompt, response) and outputs a single scalar reward.
Given preference pairs $(y_w, y_l)$ where $y_w$ is the preferred (winning) response and $y_l$ is the rejected (losing) response:
$$\mathcal{L}{RM} = -\log \sigma(r\theta(x, y_w) - r_\theta(x, y_l))$$
This is the Bradley-Terry model: the probability of preferring $y_w$ over $y_l$ is modeled as:
$$P(y_w \succ y_l | x) = \sigma(r_\theta(x, y_w) - r_\theta(x, y_l))$$
The model learns to assign higher reward to preferred responses.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
import torchimport torch.nn as nnimport torch.nn.functional as F class RewardModel(nn.Module): """ Reward model built on a transformer backbone. """ def __init__(self, base_model, hidden_dim): super().__init__() self.backbone = base_model # Remove language modeling head, add reward head self.reward_head = nn.Linear(hidden_dim, 1) def forward(self, input_ids, attention_mask): """ Compute reward for (prompt, response) pair. Returns: rewards: [batch_size, 1] scalar rewards """ # Get hidden states from backbone outputs = self.backbone( input_ids=input_ids, attention_mask=attention_mask, output_hidden_states=True ) # Use last token's hidden state (end of response) last_hidden = outputs.hidden_states[-1] # Find position of last non-padding token sequence_lengths = attention_mask.sum(dim=1) - 1 last_token_hidden = last_hidden[ torch.arange(last_hidden.size(0)), sequence_lengths ] # Compute scalar reward rewards = self.reward_head(last_token_hidden) return rewards def compute_rm_loss(model, chosen_ids, rejected_ids, chosen_mask, rejected_mask): """ Compute Bradley-Terry preference loss. """ chosen_rewards = model(chosen_ids, chosen_mask) rejected_rewards = model(rejected_ids, rejected_mask) # Log-sigmoid of reward difference loss = -F.logsigmoid(chosen_rewards - rejected_rewards).mean() # Accuracy for monitoring accuracy = (chosen_rewards > rejected_rewards).float().mean() return loss, accuracyPreference data collection is expensive but critical:
Data Collection Pipeline:
Annotation Guidelines Example:
When comparing responses, prefer responses that are:
1. Helpful: Actually answers the question
2. Honest: Doesn't make claims without basis
3. Harmless: Avoids generating harmful content
4. Concise: Doesn't pad with unnecessary content
5. Well-formatted: Easy to read and understand
When in conflict: Harmless > Honest > Helpful
| Data Component | Typical Scale | Cost |
|---|---|---|
| Diverse prompts | 20K-100K | ~$5K (collection) |
| Generated responses | 2-4 per prompt | Compute cost |
| Human comparisons | 50K-500K pairs | $100K-$1M |
| Quality checking | 10% re-annotated | 10% additional |
Reward models have limited accuracy. If the policy overoptimizes, it finds adversarial examples that score highly on the reward model but poorly to real humans. This is 'reward hacking'—the reward model's failures become the policy's objectives.
Proximal Policy Optimization (PPO) is the most common algorithm for RLHF. It optimizes the policy while constraining how much it can change per update.
We want to maximize reward while staying close to the SFT model:
$$J(\pi) = \mathbb{E}{x \sim D, y \sim \pi(y|x)} \left[ r\theta(x, y) \right] - \beta \cdot D_{KL}(\pi || \pi_{SFT})$$
The KL penalty serves multiple purposes:
Step 1: Sample rollouts
for prompt in batch:
response = policy.generate(prompt) # Sample from current policy
reward = reward_model(prompt, response) # Score with RM
old_logprobs = policy.log_prob(response | prompt) # For PPO ratio
Step 2: Compute advantages
# Per-token rewards (reward assigned to final token, or distributed)
token_rewards = compute_token_rewards(response, reward)
# Value function estimates
values = value_model(prompt + response)
# GAE (Generalized Advantage Estimation)
advantages = compute_gae(token_rewards, values, gamma=1.0, lam=0.95)
Step 3: PPO update
for _ in range(ppo_epochs):
new_logprobs = policy.log_prob(response | prompt)
ratio = exp(new_logprobs - old_logprobs)
# Clipped objective
pg_loss1 = -advantages * ratio
pg_loss2 = -advantages * clip(ratio, 1-ε, 1+ε)
pg_loss = max(pg_loss1, pg_loss2).mean()
# Value loss
value_loss = (values - returns).pow(2).mean()
# KL penalty
kl = compute_kl(policy, sft_policy, prompt, response)
# Total loss
loss = pg_loss + 0.5 * value_loss + β * kl
loss.backward()
optimizer.step()
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
from trl import PPOTrainer, PPOConfig # Configurationppo_config = PPOConfig( model_name="sft-model", learning_rate=1.41e-5, batch_size=256, mini_batch_size=64, ppo_epochs=4, # PPO updates per batch gradient_accumulation_steps=4, # PPO-specific cliprange=0.2, # PPO clip parameter ε cliprange_value=0.2, # Value function clip vf_coef=0.1, # Value loss coefficient # KL control init_kl_coef=0.2, # Initial β target_kl=6.0, # Adaptive KL target adap_kl_ctrl=True, # Adapt β to maintain target # Generation max_new_tokens=256, temperature=1.0,) # Initialize trainertrainer = PPOTrainer( config=ppo_config, model=policy_model, ref_model=sft_model, # Reference for KL tokenizer=tokenizer, dataset=prompt_dataset, reward_model=reward_model,) # Training loopfor epoch in range(num_epochs): for batch in trainer.dataloader: # Generate responses query_tensors = batch["input_ids"] response_tensors = trainer.generate(query_tensors) # Compute rewards texts = [tokenizer.decode(r) for r in response_tensors] rewards = [reward_model.score(t) for t in texts] # PPO step stats = trainer.step(query_tensors, response_tensors, rewards) # Log metrics print(f"Mean reward: {stats['ppo/mean_rewards']:.3f}") print(f"KL divergence: {stats['ppo/kl']:.3f}")| Hyperparameter | Typical Range | Too Low | Too High |
|---|---|---|---|
| KL coefficient (β) | 0.01-0.2 | Reward hacking | Stuck at SFT |
| PPO clip (ε) | 0.1-0.3 | Slow learning | Unstable updates |
| Learning rate | 1e-6 to 5e-5 | Slow convergence | KL explosion |
| PPO epochs | 1-4 | Underutilize data | Overfitting batch |
| Batch size | 64-512 | High variance | Less exploration |
Adaptive KL control is crucial:
# Adjust β based on observed KL
if observed_kl > target_kl:
β *= 1.5 # KL too high, increase penalty
else:
β /= 1.5 # KL too low, decrease penalty
This maintains KL in a reasonable range without manual tuning.
RLHF training is notoriously sensitive. Common failure modes: KL divergence explosion, reward model exploitation, mode collapse, and perplexity degradation. Monitor closely: reward, KL, response diversity, and downstream task performance. Be prepared to restart from checkpoint.
PPO-based RLHF is effective but complex—requiring separate reward and value models, careful hyperparameter tuning, and online generation. Direct Preference Optimization (DPO) provides a simpler alternative that achieves similar results.
DPO observes that the optimal policy under the RLHF objective has a closed-form relationship with the reward:
$$\pi^*(y|x) = \frac{1}{Z(x)} \pi_{ref}(y|x) \exp\left(\frac{1}{\beta} r(x, y)\right)$$
Solving for the reward:
$$r(x, y) = \beta \log \frac{\pi^*(y|x)}{\pi_{ref}(y|x)} + \beta \log Z(x)$$
Substituting into the Bradley-Terry preference model, the partition function $Z(x)$ cancels, yielding the DPO loss:
$$\mathcal{L}{DPO} = -\log \sigma\left(\beta \log \frac{\pi\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\right)$$
This is a supervised learning objective on preference data—no RL needed!
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
import torchimport torch.nn.functional as F def compute_dpo_loss( policy_model, reference_model, chosen_ids, # Preferred response tokens rejected_ids, # Rejected response tokens attention_mask, beta=0.1 # KL penalty strength): """ Compute DPO loss for a batch of preference pairs. """ # Get log probabilities from policy model policy_chosen_logprobs = get_sequence_logprobs( policy_model, chosen_ids, attention_mask ) policy_rejected_logprobs = get_sequence_logprobs( policy_model, rejected_ids, attention_mask ) # Get log probabilities from reference model (frozen) with torch.no_grad(): ref_chosen_logprobs = get_sequence_logprobs( reference_model, chosen_ids, attention_mask ) ref_rejected_logprobs = get_sequence_logprobs( reference_model, rejected_ids, attention_mask ) # Compute log ratios chosen_log_ratio = policy_chosen_logprobs - ref_chosen_logprobs rejected_log_ratio = policy_rejected_logprobs - ref_rejected_logprobs # DPO loss: log-sigmoid of scaled difference logits = beta * (chosen_log_ratio - rejected_log_ratio) loss = -F.logsigmoid(logits).mean() # Metrics chosen_reward = beta * chosen_log_ratio.detach() rejected_reward = beta * rejected_log_ratio.detach() reward_margin = (chosen_reward - rejected_reward).mean() accuracy = (chosen_reward > rejected_reward).float().mean() return loss, { "reward_margin": reward_margin, "accuracy": accuracy, "chosen_reward": chosen_reward.mean(), "rejected_reward": rejected_reward.mean(), } def get_sequence_logprobs(model, input_ids, attention_mask): """Compute sum of log probabilities for a sequence.""" outputs = model(input_ids, attention_mask=attention_mask) logits = outputs.logits[:, :-1, :] # Predict next token labels = input_ids[:, 1:] # Shifted labels log_probs = F.log_softmax(logits, dim=-1) selected_log_probs = log_probs.gather( dim=-1, index=labels.unsqueeze(-1) ).squeeze(-1) # Sum over sequence (masking padding) mask = attention_mask[:, 1:] # Match shifted length return (selected_log_probs * mask).sum(dim=1)IPO (Identity Preference Optimization): Addresses overfitting in DPO by using a different loss formulation that doesn't drive log-ratios to infinity.
KTO (Kahneman-Tversky Optimization): Learns from binary feedback (good/bad) rather than preferences. Useful when comparison data is unavailable.
ORPO (Odds Ratio Preference Optimization): Combines SFT and preference learning in a single objective, avoiding the need for a separate SFT phase.
SimPO (Simple Preference Optimization): Removes the need for a reference model by using sequence length normalization.
| Method | Reference Model? | RL? | Data Type | Complexity |
|---|---|---|---|---|
| PPO | Yes | Yes | Online generation | High |
| DPO | Yes | No | Offline preferences | Medium |
| KTO | Optional | No | Binary feedback | Low |
| SimPO | No | No | Offline preferences | Low |
Use DPO when: you have high-quality preference data, want simpler training, or have limited compute. Use PPO when: you need online exploration, have access to reward model, or require iterative improvement with human feedback in the loop.
RLHF is powerful but imperfect. Understanding its limitations is crucial for responsible deployment.
The policy finds ways to score highly on the reward model without satisfying actual human preferences:
Examples:
Mitigations:
RLHF optimizes for the preferences of annotators, which may not represent all users:
The alignment target is fundamentally contested. Different users want different behaviors from AI. RLHF produces a model aligned to a particular conception of "good."
Human preferences are complex and context-dependent. Any fixed reward model inevitably oversimplifies:
| What We Want | What RM Might Reward |
|---|---|
| Honest uncertainty | Confident-sounding hedging |
| Helpful refusal | Refusing too often to be safe |
| Concise answers | Incomplete answers |
| Creative responses | Unpredictable responses |
RLHF primarily changes how the model presents information, not what it knows:
Concerning: A model that confidently refuses to help with harmful requests... but could be jailbroken because the underlying capability remains.
RLHF improves model behavior significantly but does not ensure models are safe, honest, or aligned in any deep sense. It's a technique for teaching preferences, not values. Systems aligned with RLHF still require careful monitoring, use restrictions, and ongoing safety work.
RLHF represents a fundamental shift from imitation to optimization—teaching models to satisfy human preferences rather than copy human outputs. This enables more nuanced alignment than SFT alone.
What's next:
With an understanding of how LLMs are trained and aligned, we turn to how they're used. The next page covers prompting and in-context learning—the art and science of getting the best out of language models through careful prompt design, few-shot examples, and chain-of-thought reasoning.
You now understand RLHF—the technique that transforms capable but uncontrolled language models into helpful assistants aligned with human preferences. This knowledge enables you to implement, evaluate, and reason about aligned AI systems. Next, we explore prompting and in-context learning.