In Depth

Preference optimization encompasses methods that improve AI models by learning from human judgments about which outputs are better. Instead of specifying exactly what the correct output should be, these methods use relative comparisons: given two model responses to the same prompt, which one do humans prefer? This approach captures subjective quality aspects that are difficult to specify with explicit rules.

The most established method is RLHF (Reinforcement Learning from Human Feedback), which trains a separate reward model on human preferences and then optimizes the language model using reinforcement learning. DPO (Direct Preference Optimization) simplifies this by eliminating the separate reward model and directly optimizing the language model on preference pairs. Other variants include KTO (Kahneman-Tversky Optimization), IPO, and ORPO.

Preference optimization is the final training stage for most commercial AI assistants, following pre-training and instruction tuning. It is responsible for making models more helpful, reducing harmful outputs, improving response quality, and aligning behavior with human expectations. The quality of preference data, including annotator guidelines and inter-annotator agreement, directly impacts the final model's behavior.