What is Reinforcement Learning from Human Feedback?

NexChron

Reinforcement Learning from Human Feedback

Definition A training method where human raters compare model outputs and their preferences are used to train a reward model, which then guides further RL training. RLHF aligns model behavior with human values and instructions far more effectively than pure supervised learning alone.

In Depth

The process has three stages: supervised fine-tuning on demonstrations, training a reward model on human preference rankings, and optimizing the LLM against the reward model using PPO or similar RL algorithms. RLHF was central to creating InstructGPT and subsequently ChatGPT, Claude, and other assistant models. Its main limitation is that quality depends heavily on the consistency and representativeness of human raters.

Browse more terms

AI Agent AI Alignment AI Audit AI Bill of Rights AI Compute AI Governance AI Orchestration AI Readiness AI Risk Management AI Watermarking AI-as-a-Service Activation Function Active Learning Adversarial Attack Agentic AI Agentic Workflow Algorithmic Fairness Arctic Artificial General Intelligence Artificial Superintelligence