In Depth
RLHF (Reinforcement Learning from Human Feedback) and DPO (Direct Preference Optimization) are the two dominant methods for the final alignment stage of language model training. Both start with the same input: human preference data in the form of pairs where annotators indicate which of two model responses is better. They differ fundamentally in how they use this data.
RLHF follows a multi-step process: first train a reward model on preference data, then use reinforcement learning (typically PPO) to optimize the language model against that reward model. This is complex, requiring careful hyperparameter tuning and management of multiple models simultaneously. DPO, introduced in 2023, showed that the reward model step can be mathematically eliminated, directly optimizing the language model on preference pairs with a simple classification-like loss function.
DPO is simpler, more stable, and cheaper to implement than RLHF, which has driven rapid adoption. However, RLHF proponents argue it can achieve better results at frontier scale because the explicit reward model provides more signal and enables online data collection. In practice, many organizations now use DPO or its variants (KTO, IPO, ORPO) for alignment, while the largest labs continue to invest in RLHF infrastructure. The debate is ongoing and the optimal approach may depend on model scale and available resources.