In Depth

The process has three stages: supervised fine-tuning on demonstrations, training a reward model on human preference rankings, and optimizing the LLM against the reward model using PPO or similar RL algorithms. RLHF was central to creating InstructGPT and subsequently ChatGPT, Claude, and other assistant models. Its main limitation is that quality depends heavily on the consistency and representativeness of human raters.