Skip to main content

RLHF

RLHF (reinforcement learning from human feedback) is a training method in which human ratings align an AI model toward helpful and safe answers. It made models like ChatGPT practical for everyday use.

In RLHF, humans rate different model responses, and a reward model is learned from these preferences. The language model is then optimized with reinforcement learning to produce preferred responses. This makes models more helpful, polite, and safe in line with human expectations. RLHF was a key step that turned raw language models into useful assistants.

Related terms

From term to practice

Save, version, and share your best prompts with Prompt2Love.

Get started free