RLHF
Reinforcement Learning from Human Feedback
Plain Explanation
Pretrained language models are great at continuing text, but their objective (predict the next token) doesn’t guarantee answers people actually want. Even after supervised instruction tuning, models can still be verbose, unsafe, or unhelpful because “what counts as better” is hard to write as a rule. RLHF (Reinforcement Learning from Human Feedback) addresses this mismatch by turning human preferences into a training signal so the model’s behavior aligns with what people consistently choose. You can picture RLHF like a coach reviewing two drafts for the same prompt and picking the better one. At scale, many such A/B choices are collected on tasks like public TL;DR summarization or helpfulness/harmlessness comparisons, building a dataset of “chosen vs. rejected” replies. A separate reward model is then trained to score responses so that chosen answers rate higher than rejected ones, providing a fast proxy for future judgments. The main model (treated as a policy over tokens) generates candidate replies that are scored by the reward model; those scores are turned into learning signals that increase the probability of higher‑scoring responses. Policy‑gradient methods such as PPO are commonly used, with explicit controls on KL divergence (a measure of how much the new model’s word probabilities drift from the old baseline) to keep updates stable; this trust‑region/KL penalty or clipping helps prevent the policy from chasing noisy rewards or forgetting its supervised skills.
Examples & Analogies
- Public TL;DR summarization: For a given article, two summaries are compared and humans pick the clearer, more faithful one. Those pairwise choices train a reward model, and the policy is optimized so its summaries are more likely to earn higher reward on future articles.
- Helpfulness/Harmlessness comparisons: Annotators choose responses that are useful and safe over ones that are evasive or risky. The learned reward then guides RL updates so the assistant more reliably prefers the helpful/harmless style in similar prompts.
- Rules‑based reward model evaluation: When studying algorithms, a rules‑based reward can stand in for human labels (for example, scoring structural or formatting requirements). Researchers can then compare RLHF variants on how well the policy learns to satisfy those rules while remaining close to its baseline language quality.
At a Glance
| RLHF | Instruction tuning (SFT) | Direct alignment (DPO-like) | |
|---|---|---|---|
| Label signal | Human preference pairs (chosen vs. rejected) | Supervised examples of good replies | Preference pairs without learning a reward model |
| Uses reward model? | Yes, to score new generations | No | No (optimizes a direct preference objective) |
| Optimizer | RL policy gradients (often PPO) | Next‑token log‑likelihood | Direct gradient on preference objective |
| Stability control | KL penalty/clipping to limit drift from SFT | N/A beyond standard regularization | Careful objective design and hyperparameters |
| Typical aim | Align behavior and safety style post‑SFT | Teach formatting and task following | Preference alignment without an intermediate scorer |
RLHF inserts an explicit reward‑model plus RL step with KL control, while SFT relies on supervised examples and direct alignment takes gradients on preference pairs; even direct alignment still depends on thoughtful hyperparameters and regularization.
Where and Why It Matters
- Post‑training standard: RLHF sits alongside instruction tuning and related stages in modern post‑training pipelines, providing a dedicated preference‑alignment step.
- Algorithm choice matters: RLHF is not synonymous with PPO. With the same preference data, reward-model quality, KL control, and optimizer choice can materially change outcomes, so teams usually compare a small set of candidates on a held-out preference set.
- Reward‑model caveats affect outcomes: Sparse feedback, misspecification, and misgeneralization in the reward model can degrade final policy quality, motivating conservative update sizes and careful data collection.
- Stability practices: KL‑controlled or trust‑region‑style updates (e.g., PPO with penalty or clipping) are commonly used to keep the policy close to its supervised baseline while improving reward.
- Where it’s most useful: RLHF is favored when “what we want” is easier to express as human preferences than as a hard rule or a simple accuracy label, such as tone, helpfulness, or safety trade‑offs.
Common Misconceptions
- ❌ Myth: RLHF teaches the model new facts and broad knowledge → ✅ Reality: It is a post‑training alignment step; it adjusts behavior and style rather than core world knowledge.
- ❌ Myth: PPO is the only way to do RLHF → ✅ Reality: Multiple RLHF and preference‑learning algorithms exist, including direct alignment methods that skip a learned reward model.
- ❌ Myth: A more accurate reward model always yields a better assistant → ✅ Reality: Studies note cases where better reward accuracy doesn’t translate to a better policy due to optimization dynamics and distribution shift.
How It Sounds in Conversation
- "Let’s keep the KL stable during RLHF; if it spikes, we’ll raise the penalty or tighten the clip before we blow up quality."
- "The reward model likes shorter answers; can we rebalance the preference data so we don’t over‑compress TL;DR outputs?"
- "Run the HH eval after tonight’s PPO sweep; I want to see if the stricter KL actually reduced unsafe completions."
- "Spin a direct‑alignment baseline next to PPO so we can compare alignment gains without the reward‑model detour."
- "We’re bottlenecked on preference pairs; let’s prioritize prompts where SFT fails and sample more challenging negatives."
Related Reading
References
- Training language models to follow instructions with human feedback
Canonical InstructGPT paper describing SFT, reward modeling, and PPO-style RLHF.
- Deep Reinforcement Learning from Human Preferences
Foundational human-preference reward-modeling paper.
- Learning to summarize from human feedback
RLHF application to summarization with human feedback and reward models.
- Aligning language models to follow instructions
OpenAI explanation of InstructGPT/RLHF, human preference data, and alignment goals.
- Reinforcement Learning from Human Feedback
Book-length treatment of RLHF data, reward models, PPO, and preference optimization.