What is Preference Optimization?

Preference Optimization — AI Glossary

Preference Optimization

Definition Training techniques that align AI model outputs with human preferences by learning from comparisons of better and worse responses rather than from absolute labels.

In Depth

Preference optimization encompasses methods that improve AI models by learning from human judgments about which outputs are better. Instead of specifying exactly what the correct output should be, these methods use relative comparisons: given two model responses to the same prompt, which one do humans prefer? This approach captures subjective quality aspects that are difficult to specify with explicit rules.

The most established method is RLHF (Reinforcement Learning from Human Feedback), which trains a separate reward model on human preferences and then optimizes the language model using reinforcement learning. DPO (Direct Preference Optimization) simplifies this by eliminating the separate reward model and directly optimizing the language model on preference pairs. Other variants include KTO (Kahneman-Tversky Optimization), IPO, and ORPO.

Preference optimization is the final training stage for most commercial AI assistants, following pre-training and instruction tuning. It is responsible for making models more helpful, reducing harmful outputs, improving response quality, and aligning behavior with human expectations. The quality of preference data, including annotator guidelines and inter-annotator agreement, directly impacts the final model's behavior.

Preference Optimization

In Depth

Browse more terms