AI Dictionary

Reinforcement Learning from Human Feedback (RLHF)

Definition

A technique that trains a reward model directly from human feedback to better align the agent with human preferences.

Deep Dive

Reinforcement Learning from Human Feedback (RLHF) is a powerful technique that aims to align the behavior of AI models, particularly large language models (LLMs), with complex human values and preferences. It addresses the challenge of making AI systems not just perform tasks correctly, but also in a way that is helpful, harmless, and honest, which can be difficult to specify purely through algorithmic reward functions. The process involves leveraging human evaluators to provide feedback on the AI's generated outputs, often by ranking multiple options based on quality, relevance, or safety.

Examples & Use Cases

1Fine-tuning a large language model (LLM) like ChatGPT to generate responses that are helpful, harmless, and align with ethical guidelines
2Training a conversational AI to better understand and respond to user intent in a more natural and polite manner
3Improving the relevance and quality of search engine results by having users rate the usefulness of presented information

Related Terms

Reinforcement LearningLarge Language Model (LLM)Human-in-the-Loop