Skip to main content

RLHF (Reinforcement Learning from Human Feedback)

A training technique that aligns LLMs with human preferences using ranked outputs.

RLHF is the training technique that turned raw language models into the helpful chatbots people actually use. Humans rank model outputs; a reward model is trained to predict those rankings; the LLM is fine-tuned to produce outputs the reward model rates highly.

The practical result: models that are more helpful, less likely to produce harmful content, and better at following instructions. RLHF (and successors like DPO and Constitutional AI) is why ChatGPT-3.5 felt like a step-change vs. earlier models, even though the underlying capability wasn't dramatically different.

Operators don't need to understand the math. The relevant takeaway: the alignment and instruction-following quality you experience from a model is largely determined by post-training (RLHF and successors), not just pre-training scale. This is why two models with similar parameter counts can feel completely different to use.

Coming soon

Get the weekly digest

New tools, reviews, and prompts every Friday.