Reinforcement Learning from Human Feedback
Abstract
Reinforcement learning from human feedback (RLHF) has become a crucial tool to build the latest machine learning systems at scale. The field grew around the core methods of RLHF into today's broader suite of post-training techniques. In this book, we give a comprehensive introduction to the core methods for post-training models for people with some level of quantitative background, organized around the canonical RLHF recipe. The book starts with what RLHF does and why it was created, with seminal technical milestones in its young history and a primer on reinforcement learning context needed to understand the book. The core of the book details every optimization stage in using RLHF, from starting with instruction tuning to training a reward model and finally all of rejection sampling, reinforcement learning, on-policy distillation, and direct alignment algorithms. The book also discusses broader topics, such as the origins of RLHF -- both in recent literature and in a convergence of disparate fields of science in economics, philosophy, and optimal control. The book concludes with advanced topics -- understudied or emerging research questions in synthetic data, tool-use, character training, and evaluation -- and open questions for the field. The book is released with a variety of companion resources, including a codebase, a library to compare model completions from within post-training stages, and an educational course, to be a one-stop shop for learning all foundational concepts for post-training language models.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.