Aligning LLMs RLHF: Part 1

Fine-tuning and alignment are often misunderstood terms regarding Large Language Models (LLMs). In this series on Aligning LLMs, we will cover the most popular fine-tuning alignment methods, as well as emerging techniques, namely:

  1. ​Reinforcement Learning with Human Feedback (RLHF)
  2. ​Reinforcement Learning with AI Feedback (RLAIF)
  3. ​Direct Preference Optimization (DPO)
  4. ​Reasoning with Reinforced Fine-Tuning (ReFT)

​In our first event, we tackle RLHF, an often glossed-over topic for most AI Engineers. By the end of the session, we aim to give you some deep intuition into how models like InstructGPT and Llama 2 leveraged human feedback to align with us on what it means to be “helpful, honest, and harmless.”

​We’ll cover where RLHF fits in within the context of training LLMs. Generally, this process starts with unsupervised pre-training, followed by supervised fine-tuning before RLHF is used for a final polish. RLHF breaks down into three simple steps, each covered in detail.

  1. ​The first is to start with a pre-trained-based model and fine-tune it to respond well to many types of instructions; in other words, to instruct-tune it to increase its helpfulness.
  2. ​The next step is to train a reward (a.k.a. preference) model specifically used to represent human preferences, which requires multiple responses for a diverse set of prompts, each order-ranked by human labelers. In this event, we will demonstrate how to train a reward model with two choices per prompt, given by [chosen, rejected] prompt-response pairs.
  3. ​The third and final step is to fine-tune using Reinforcement Learning (RL). Note that we will map RL vocabularies like policy, action space, observation space, and reward function directly to our LLM alignment problem during the event!

​Finally, we’ll discuss the limitations of RLHF, which will motivate the continuation of our series on alignment!

​We will begin our code demonstrations with a fine-tuned version of Mistral-7B-v0.1 called Zephyr-7B-Alpha, which has already been tuned for helpfulness. Then we’ll train a BERT-style rewards model distilroberta-base for sequence classification using the helpful-harmless dataset from Anthropic.

Finally, we’ll optimize generations from our Zephyr model using our rewards model by leveraging the real-toxicity-prompts dataset from the Allen Institute for AI (AI2).

​We will perform all steps in a Google Colab notebook environment, and all code will be provided directly to attendees!

​Join us live to learn:

  • ​The role that RLHF plays in aligning base LLMs toward being helpful and harmless.
  • ​How to choose reference (policy) and reward models and datasets for training them.
  • ​How RL, through Proximal Policy Optimization (PPO), fine-tunes the initial LLM!


Speakers:

  • Dr. Greg Loughnane is the Co-Founder & CEO of AI Makerspace, where he serves as an instructor for their AI Engineering Bootcamp. Since 2021 he has built and led industry-leading Machine Learning education programs.  Previously, he worked as an AI product manager, a university professor teaching AI, an AI consultant and startup advisor, and an ML researcher.  He loves trail running and is based in Dayton, Ohio.
  • Chris Alexiuk is the Co-Founder & CTO at AI Makerspace, where he serves as an instructor for their AI Engineering Bootcamp. Previously, he’s held roles as a Founding Machine Learning Engineer, Data Scientist, and ML curriculum developer and instructor. He’s a YouTube content creator YouTube who’s motto is “Build, build, build!” He loves Dungeons & Dragons and is based in Toronto, Canada.

​​Follow AI Makerspace on LinkedIn & YouTube to stay updated with workshops, new courses, and opportunities for corporate training.