Vol.01 · No.10 CS · AI · Infra May 30, 2026

AI Glossary

GlossaryReferenceLearn
LLM & Generative AI Deep Learning

RLHF

Reinforcement Learning from Human Feedback

Difficulty

Plain Explanation

Pretrained language models are great at continuing text, but their objective (predict the next token) doesn’t guarantee answers people actually want. Even after supervised instruction tuning, models can still be verbose, unsafe, or unhelpful because “what counts as better” is hard to write as a rule. RLHF (Reinforcement Learning from Human Feedback) addresses this mismatch by turning human preferences into a training signal so the model’s behavior aligns with what people consistently choose. You can picture RLHF like a coach reviewing two drafts for the same prompt and picking the better one. At scale, many such A/B choices are collected on tasks like public TL;DR summarization or helpfulness/harmlessness comparisons, building a dataset of “chosen vs. rejected” replies. A separate reward model is then trained to score responses so that chosen answers rate higher than rejected ones, providing a fast proxy for future judgments. The main model (treated as a policy over tokens) generates candidate replies that are scored by the reward model; those scores are turned into learning signals that increase the probability of higher‑scoring responses. Policy‑gradient methods such as PPO are commonly used, with explicit controls on KL divergence (a measure of how much the new model’s word probabilities drift from the old baseline) to keep updates stable; this trust‑region/KL penalty or clipping helps prevent the policy from chasing noisy rewards or forgetting its supervised skills.

Examples & Analogies

  • Public TL;DR summarization: For a given article, two summaries are compared and humans pick the clearer, more faithful one. Those pairwise choices train a reward model, and the policy is optimized so its summaries are more likely to earn higher reward on future articles.
  • Helpfulness/Harmlessness comparisons: Annotators choose responses that are useful and safe over ones that are evasive or risky. The learned reward then guides RL updates so the assistant more reliably prefers the helpful/harmless style in similar prompts.
  • Rules‑based reward model evaluation: When studying algorithms, a rules‑based reward can stand in for human labels (for example, scoring structural or formatting requirements). Researchers can then compare RLHF variants on how well the policy learns to satisfy those rules while remaining close to its baseline language quality.

At a Glance

RLHFInstruction tuning (SFT)Direct alignment (DPO-like)
Label signalHuman preference pairs (chosen vs. rejected)Supervised examples of good repliesPreference pairs without learning a reward model
Uses reward model?Yes, to score new generationsNoNo (optimizes a direct preference objective)
OptimizerRL policy gradients (often PPO)Next‑token log‑likelihoodDirect gradient on preference objective
Stability controlKL penalty/clipping to limit drift from SFTN/A beyond standard regularizationCareful objective design and hyperparameters
Typical aimAlign behavior and safety style post‑SFTTeach formatting and task followingPreference alignment without an intermediate scorer

RLHF inserts an explicit reward‑model plus RL step with KL control, while SFT relies on supervised examples and direct alignment takes gradients on preference pairs; even direct alignment still depends on thoughtful hyperparameters and regularization.

Where and Why It Matters

  • Post‑training standard: RLHF sits alongside instruction tuning and related stages in modern post‑training pipelines, providing a dedicated preference‑alignment step.
  • Algorithm choice matters: RLHF is not synonymous with PPO. With the same preference data, reward-model quality, KL control, and optimizer choice can materially change outcomes, so teams usually compare a small set of candidates on a held-out preference set.
  • Reward‑model caveats affect outcomes: Sparse feedback, misspecification, and misgeneralization in the reward model can degrade final policy quality, motivating conservative update sizes and careful data collection.
  • Stability practices: KL‑controlled or trust‑region‑style updates (e.g., PPO with penalty or clipping) are commonly used to keep the policy close to its supervised baseline while improving reward.
  • Where it’s most useful: RLHF is favored when “what we want” is easier to express as human preferences than as a hard rule or a simple accuracy label, such as tone, helpfulness, or safety trade‑offs.

Common Misconceptions

  • ❌ Myth: RLHF teaches the model new facts and broad knowledge → ✅ Reality: It is a post‑training alignment step; it adjusts behavior and style rather than core world knowledge.
  • ❌ Myth: PPO is the only way to do RLHF → ✅ Reality: Multiple RLHF and preference‑learning algorithms exist, including direct alignment methods that skip a learned reward model.
  • ❌ Myth: A more accurate reward model always yields a better assistant → ✅ Reality: Studies note cases where better reward accuracy doesn’t translate to a better policy due to optimization dynamics and distribution shift.

How It Sounds in Conversation

  • "Let’s keep the KL stable during RLHF; if it spikes, we’ll raise the penalty or tighten the clip before we blow up quality."
  • "The reward model likes shorter answers; can we rebalance the preference data so we don’t over‑compress TL;DR outputs?"
  • "Run the HH eval after tonight’s PPO sweep; I want to see if the stricter KL actually reduced unsafe completions."
  • "Spin a direct‑alignment baseline next to PPO so we can compare alignment gains without the reward‑model detour."
  • "We’re bottlenecked on preference pairs; let’s prioritize prompts where SFT fails and sample more challenging negatives."

Related Reading

References

Helpful?