RLHF (Reinforcement Learning from Human Feedback)

RLHF is the training technique that turned raw language models into the helpful chatbots people actually use. Humans rank model outputs; a reward model is trained to predict those rankings; the LLM is fine-tuned to produce outputs the reward model rates highly.

The practical result: models that are more helpful, less likely to produce harmful content, and better at following instructions. RLHF (and successors like DPO and Constitutional AI) is why ChatGPT-3.5 felt like a step-change vs. earlier models, even though the underlying capability wasn't dramatically different.

Operators don't need to understand the math. The relevant takeaway: the alignment and instruction-following quality you experience from a model is largely determined by post-training (RLHF and successors), not just pre-training scale. This is why two models with similar parameter counts can feel completely different to use.

Related

Get in touch