In Depth
DPO directly optimizes the language model on pairs of preferred vs. rejected outputs, eliminating the reward model training step. This makes alignment cheaper and faster while maintaining quality comparable to full RLHF.
Top Stories
All Sections
Resources & Tools
About
DPO directly optimizes the language model on pairs of preferred vs. rejected outputs, eliminating the reward model training step. This makes alignment cheaper and faster while maintaining quality comparable to full RLHF.