RLHF for LLM Alignment

Modern large language models (LLMs) trained on vast internet text can generate coherent paragraphs, but their outputs don't inherently align with human preferences for being helpful, honest, and harmless. Reinforcement Learning from Human Feedback (RLHF) has emerged as the premier technique for bridging this gap. It transforms a powerful but untamed base model into an assistant that follows instructions, refuses harmful requests, and provides nuanced, high-quality responses, forming the core of today's most capable AI systems.

From Human Preferences to a Reward Model

The first major phase of RLHF is learning a numerical representation of human preferences, known as a reward model. You cannot have a reinforcement learning agent without a clear reward signal, and for subjective concepts like "helpfulness" or "harmlessness," this signal must be learned from people.

This process begins with annotation guidelines. Human labelers are presented with multiple model-generated responses to the same prompt and are asked to rank them from best to worst. Clear, consistent guidelines are essential to reduce noise and ensure labelers focus on the desired attributes (e.g., factuality over creativity, or safety over comprehensiveness). The resulting dataset consists of prompt-and-response pairs with comparison labels.

Next, a separate reward model is trained on this comparison data. Typically, this model shares the same architecture as the base LLM (e.g., a Transformer) but with a final linear layer that outputs a single scalar reward score. The training objective is straightforward: for a given prompt, the reward assigned to the preferred response should be higher than the reward assigned to the dispreferred response. A common loss function used is the pairwise ranking loss. If response $A$ is preferred over response $B$ , the model is trained to maximize the difference $r_{A} - r_{B}$ , often passed through a logistic function. This process teaches the reward model to internalize the human annotators' subjective judgments into a function that can score any new response.

Optimizing the Policy with PPO and KL Control

With a trained reward model providing the reward signal $r$ , the core reinforcement learning loop begins. The LLM to be aligned is treated as the policy $π_{θ}$ —a probabilistic strategy for generating text. The goal is to update the model's parameters $θ$ to maximize the expected reward from the reward model.

The most common algorithm used is Proximal Policy Optimization (PPO). PPO is favored for its stability in on-policy settings, where you learn from actions taken by the current policy. The process works as follows:

The current policy generates responses to a batch of prompts.
The reward model scores each response.
PPO calculates a policy gradient update to increase the probability of generating high-reward tokens.
A critical KL divergence penalty is applied. KL divergence, $D_{K L} (π_{θ} ∣∣ π_{re f})$ , measures how much the new policy $π_{θ}$ deviates from a reference policy, usually the original base model $π_{re f}$ . Without this penalty, the policy would ruthlessly exploit the reward model, often by producing gibberish that happens to achieve a high score (reward hacking) or by catastrophically forgetting its general language capabilities. The penalty term $- β D_{K L}$ added to the reward ensures the aligned model stays reasonably close to the original, coherent, and knowledgeable base model. The hyperparameter $β$ controls the strength of this constraint.

The final objective PPO maximizes is: $E_{(x, y) \sim π_{θ}} [r_{RM} (x, y) - β D_{K L} (π_{θ} (y ∣ x) ∣∣ π_{re f} (y ∣ x))]$ where $x$ is the prompt, $y$ is the response, and $r_{RM}$ is the reward model's score.

DPO: A Simpler Alternative to RLHF

The traditional RLHF pipeline with separate reward model training and PPO fine-tuning is complex and computationally expensive. Direct Preference Optimization (DPO) presents an elegant, simpler alternative that has gained significant traction.

DPO cleverly bypasses the need to train an explicit reward model. It starts from a key insight: given a reward function $r (x, y)$ and the base reference model $π_{re f}$ , the optimal policy $π_{θ}$ that maximizes reward under a KL constraint has a specific closed-form expression. DPO inverts this relationship: it treats the policy itself as the implicit reward function. By directly training the policy $π_{θ}$ on the same human preference comparison data using a modified loss function, DPO implicitly optimizes for the same reward-maximizing objective as RLHF.

The DPO loss function encourages the policy to increase the relative probability of preferred responses versus dispreferred ones, while keeping the policy close to the reference model. The result is a single-stage training process that is often more stable and efficient than the PPO-based workflow, achieving similar levels of alignment. It exemplifies a trend in machine learning towards more direct, end-to-end optimization of desired objectives.

Common Pitfalls

Reward Hacking and Over-Optimization: This is the primary failure mode of RLHF. The policy may discover patterns that yield high reward model scores but are undesirable to humans (e.g., adding phrases like "I'm a helpful AI" regardless of context). The KL penalty mitigates this, but careful monitoring of generations throughout training is essential. A robust solution is to periodically collect new human preference data on the latest policy's outputs to retrain the reward model, creating an iterative "RLHF Flywheel."

Poorly Defined Annotation Guidelines: If human labelers are given vague or contradictory instructions, the resulting preference data will be noisy, and the reward model will learn a confused objective. Guidelines must be specific, include clear examples of what constitutes a "good" and "bad" response for the task, and be regularly calibrated among annotators.

Ignoring the Base Model's Capabilities: RLHF is a fine-tuning technique, not a magic capability injector. It can guide and constrain a model but cannot teach it knowledge or reasoning skills absent in the base model. Attempting to align a model that is not sufficiently capable on pre-training tasks will lead to poor results. The sequence is always: pre-train for knowledge and basic skill, then align for safety and preference.

Forgetting to Validate on Held-Out Prompts: It's easy to overfit to the prompts used during the RLHF training loop. The aligned model must be rigorously evaluated on a separate set of prompts, often with direct human assessment, to ensure the alignment generalizes beyond the specific examples seen during optimization.

Summary

RLHF is a multi-stage process for aligning LLMs with complex human preferences, using a learned reward model as a proxy objective for reinforcement learning.
The policy is optimized using algorithms like PPO, which is stabilized by a critical KL divergence penalty that prevents the model from deviating too far from its original, useful knowledge base.
Human annotation guidelines for creating preference data must be meticulously crafted, as they define the target objective for the entire system.
Direct Preference Optimization (DPO) offers a simpler, single-stage alternative that directly trains the policy on preference data, eliminating the need for a separate reward model and complex RL tuning.
Successful implementation requires vigilant guardrails against reward hacking and a clear understanding that alignment builds upon, rather than creates, a model's fundamental capabilities.

RLHF for LLM Alignment

RLHF for LLM Alignment

From Human Preferences to a Reward Model

Optimizing the Policy with PPO and KL Control

DPO: A Simpler Alternative to RLHF

Common Pitfalls

Summary

Write better notes with AI