In Depth

DPO directly optimizes the language model on pairs of preferred vs. rejected outputs, eliminating the reward model training step. This makes alignment cheaper and faster while maintaining quality comparable to full RLHF.